

****# Why My Coding Agent Stopped Repeating Errors with Hindsight
I thought better prompts would fix my coding assistant—until I watched it confidently repeat the same error across multiple attempts.
The agent kept suggesting asyncio.run() inside an already-running event loop. I'd corrected it. It apologized. Ten minutes later, same session, different file—same mistake. I'd written careful system prompt instructions. I'd added examples. None of it stuck. The model had no idea it had just made that exact error, because from its perspective, it hadn't. Every invocation was a blank slate.
That's the problem I built around: not intelligence, but amnesia.
What the System Does
The project is a coding practice mentor—a Python agent that helps developers debug their own code. It doesn't just answer questions. It tracks which mistakes a user makes repeatedly, stores them, and uses that history to change how it responds over time.
The high-level flow is straightforward:
- User submits broken code and a description of what it should do.
- The agent diagnoses the issue and responds with a fix and explanation.
- That interaction—error type, context, how the user described the problem—gets stored as a memory.
- On the next interaction, the agent recalls relevant past mistakes before generating a response, and adjusts its suggestions accordingly.
The piece that makes this work is Hindsight—a memory system built specifically for AI agents, with a retrieval architecture that goes well beyond simple vector search.
The Architecture in Three Layers
The system has three main modules: mentor.py (the agent loop), memory_store.py (the Hindsight integration), and session_tracker.py (user behavior tracking across sessions).
mentor.py is the orchestrator. It takes a user submission, calls memory_store.recall() to pull relevant past mistakes, builds a prompt that includes that history, calls the LLM, and then calls memory_store.retain() to commit the new interaction to memory. The loop is intentionally simple—the complexity lives in the memory layer, not the agent logic.
session_tracker.py maintains a lightweight record of per-user session behavior: which error categories appeared, how many times, and whether the user accepted or revised the agent's suggestion. This feeds into what gets retained and how it gets tagged.
The interesting engineering is entirely in how memory flows through Hindsight's three primitives: retain, recall, and reflect.
The Memory Layer: Why Simple Storage Fails
My first pass at the memory system was embarrassingly naive. I stored each error event as a JSON blob in a SQLite table with a user_id, error_type, timestamp, and raw_description column. On each request, I pulled the last 10 rows for that user and shoved them into the context window.
This worked for about a day of testing before it broke down in three ways.
First, the context got noisy fast. Ten raw error records—each with its own slightly different phrasing of the same underlying mistake—didn't help the model reason about patterns. It just created redundant signal.
Second, retrieval was purely recency-based. If a user had made a specific mistake three weeks ago, it wouldn't surface even if the current problem was nearly identical. Recent but irrelevant errors crowded out older but highly relevant ones.
Third, there was no consolidation. The model couldn't tell that "forgot to await coroutine," "missing await keyword," and "coroutine object is not awaitable" were the same conceptual mistake from three different sessions.
Hindsight's observation consolidation is what solved this. When you retain() a new fact, Hindsight doesn't just store it—it analyzes it against existing memories and synthesizes observations: higher-level abstractions that capture patterns across individual facts.
# memory_store.py
from hindsight import HindsightClient
client = HindsightClient(api_key=HINDSIGHT_API_KEY)
def retain_mistake(user_id: str, error_type: str, description: str, resolution: str):
client.retain(
bank_id=f"user-{user_id}",
content=f"Error type: {error_type}. Description: {description}. Resolution: {resolution}.",
metadata={
"user_id": user_id,
"error_category": error_type,
"timestamp": datetime.utcnow().isoformat(),
}
)
After a few sessions, Hindsight consolidates individual retain() calls about async errors into an observation like: "This user consistently misuses asyncio—specifically, they attempt to call asyncio.run() in contexts where an event loop is already running. This has appeared 4 times across 3 sessions."
That observation is what gets surfaced during recall(). Not four separate raw facts—one synthesized, evidence-backed insight that the agent can actually reason about.
TEMPR Retrieval: Not Just Semantic Search
The second thing I got wrong early was treating retrieval as a semantic similarity problem. My SQLite approach used cosine similarity on embeddings. That's fine for finding conceptually similar text—but it misses a lot.
Consider the query: "Why does my FastAPI endpoint block?" A pure semantic search might surface memories about blocking I/O, or about FastAPI generally. But what I actually want is: has this specific user made async mistakes before, and if so, which ones?
Hindsight's multi-strategy TEMPR retrieval runs four strategies in parallel: semantic (conceptual similarity), keyword/BM25 (exact term matching), graph (related entities and indirect connections), and temporal (recency and time-range awareness). The results are fused before being returned.
# memory_store.py
def recall_relevant_history(user_id: str, current_problem: str) -> str:
results = client.recall(
bank_id=f"user-{user_id}",
query=current_problem,
top_k=5,
)
if not results:
return ""
history_lines = []
for r in results:
history_lines.append(f"- {r.content} (relevance: {r.score:.2f})")
return "\n".join(history_lines)
The graph retrieval is the part that surprised me most. Hindsight maintains entity relationships across retained facts—so if the system knows "user X has an async mistake" and "async mistakes often relate to event loop misuse," it can surface that connection even without an exact semantic match to the current query. This matters for a coding mentor because mistakes often cluster: a user who misuses asyncio.run() probably also struggles with await placement, and both should surface together.
The temporal dimension matters too. Error patterns from last week are more relevant than errors from two months ago—not because the older ones are wrong, but because recency signals active struggle versus resolved understanding. The retrieval layer weights this without me having to build it.
Building the Prompt with Memory
Once recall returns results, the agent loop injects them into the system prompt before calling the LLM.
# mentor.py
def build_prompt(user_code: str, problem_description: str, memory_context: str) -> list[dict]:
system = (
"You are a coding mentor helping a developer debug their code. "
"You have access to this user's past mistakes and patterns. "
"Use this history to avoid repeating suggestions that didn't work, "
"and to flag patterns you've seen before.\n\n"
)
if memory_context:
system += f"User's past error patterns:\n{memory_context}\n\n"
system += "Be direct about what the problem is. Reference past patterns if relevant."
return [
{"role": "system", "content": system},
{
"role": "user",
"content": f"Code:\n```
{% endraw %}
python\n{user_code}\n
{% raw %}
```\n\nProblem: {problem_description}"
}
]
The key design decision here is that the memory context is injected into the system prompt, not the user turn. I tried it in the user turn first—the model treated it as part of the question rather than as background context it should reason from. Moving it to the system prompt made a concrete difference in how the agent weighted that history.
After the LLM responds, the interaction is retained:
# mentor.py
def run_mentor_turn(user_id: str, code: str, description: str) -> str:
memory_context = recall_relevant_history(user_id, description)
messages = build_prompt(code, description, memory_context)
response = llm_client.chat(messages)
answer = response.choices[0].message.content
error_type = classify_error(description) # simple heuristic classifier
retain_mistake(user_id, error_type, description, answer)
return answer
classify_error() is a lightweight function—currently just keyword matching against a set of error categories (async, type errors, scope issues, import errors, etc.). It's intentionally dumb because the sophistication lives in Hindsight's consolidation, not in my classification logic. I don't need a perfect categorization upfront; I need enough signal for Hindsight to consolidate correctly over time.
What Changes After Memory Accumulates
The behavioral difference is clearest after three or four sessions with the same user.
Before memory: A user submits code with a missing await. The agent explains the issue generically—here's what await does, here's the fix.
After four sessions of async mistakes: The recall returns an observation that Hindsight has synthesized: "This user has repeatedly made async-related mistakes across 4 sessions, specifically around event loop management and missing await keywords." The agent's response changes tone. Instead of explaining await from scratch, it flags the pattern: "This is the fourth time we've seen an async issue in your code. I want to highlight the underlying mental model here, not just fix this specific instance..."
That's not prompt engineering. That's the agent reasoning from actual accumulated evidence about a specific user.
The agent memory architecture on Vectorize describes this well: the goal is not just storage but continuous refinement—observations evolve as new evidence arrives, so the agent's model of a user sharpens over time rather than staying static.
Lessons Learned
Raw fact storage doesn't scale. Dumping interaction records into a database and retrieving the last N is not memory—it's a log. Memory requires synthesis, and building your own synthesis layer is a significant amount of work. Hindsight's observation consolidation did in one API call what would have taken me weeks to approximate.
Retrieval strategy matters more than storage strategy. I spent too long thinking about how to structure what I stored, and not enough thinking about how it would be retrieved. The gap between "semantically similar" and "actually relevant given this user's specific history" is large. Multi-strategy retrieval closes that gap.
System prompt placement for memory context is non-trivial. Where you inject retrieved memories in the prompt affects how the model reasons about them. System prompt injection signals background context; user turn injection signals question content. They produce different behaviors.
Classification can be shallow if consolidation is deep. I was worried my simple classify_error() heuristic would produce noisy data. In practice, Hindsight's consolidation smooths over the noise—if three slightly different descriptions all point at the same underlying mistake, the observation captures the pattern regardless of how they were labeled on the way in.
The memory bank per user model is the right abstraction. Giving each user their own bank—bank_id=f"user-{user_id}"—meant I got isolation for free. No user's history bleeds into another's retrieval. The Hindsight documentation covers memory banks in detail, and it's worth reading before you start designing your retrieval logic.
The system isn't finished. Error classification needs to be smarter. The session tracker's behavioral signals (did the user revise the agent's answer? did they submit the same error again 10 minutes later?) aren't fully wired into what gets retained yet. There's more to build.
But the agent no longer repeats the same mistake to the same user twice. That was the goal. It's working.
https://github.com/geethaktgeethakt51-cmd/my-project/tree/main
Top comments (0)