DEV Community

Joshua Lorenzo
Joshua Lorenzo

Posted on

I Added Memory to My AI Agent. It Got Worse.

I Added Memory to My AI Agent. It Got Worse.

Adding memory to my coding mentor agent made it more confidently wrong. It stopped forgetting mistakes — it started cataloging them, then repeating the same useless explanation with slightly more context attached. That's when I realized I had confused storage with learning.


The Problem With Memory-as-Retrieval

Most "memory-enabled" AI agents work like this: you store past interactions, retrieve semantically similar ones, stuff them into a prompt, and hope the model does something useful with the context. This is RAG with extra steps, and it has a fundamental flaw — the agent's behavior never changes. It just has more words to reference.

My project was a coding mentor. Users would paste broken Python. The agent would explain what was wrong. Standard stuff. I added persistent memory using Hindsight so the system could recall past mistakes. On paper, brilliant. In practice, the agent would retrieve a user's three previous loop errors, acknowledge them with something like "I notice you've struggled with loops before," and then give the exact same technical explanation it would have given without any memory at all.

The output mentioned history. It didn't respond to it.

That's not learning. That's a better cover letter for the same rejection.


What I Actually Built

The stack is straightforward: FastAPI backend, Hindsight for persistent memory, and Groq running a large model for inference. The interesting part is none of those — it's the layer between memory retrieval and the LLM call.

Here's the actual flow on every /analyze request:

  1. Run a quick LLM pass to detect error categories in the submitted code
  2. Pull past mistakes from Hindsight for this user
  3. Count repeat occurrences per category
  4. Feed those counts into a decision engine that picks a teaching mode
  5. Build a prompt whose structural instructions change with the mode
  6. Call the LLM
  7. Store the result back to memory

Steps 3 and 4 are where this diverges from standard memory-augmented generation. The LLM never decides how to teach. A deterministic Python function does.


The Turning Point

The first working version retrieved memories and injected them into a generic prompt. Something like:

User's past mistakes: [loop error x2, syntax error x1]
Now explain what's wrong with this code.
Enter fullscreen mode Exit fullscreen mode

The model would dutifully mention the past errors and then produce a response indistinguishable from one with no memory at all. It had the information. It didn't know what to do with it.

The insight that fixed everything was this: the prompt itself needs to structurally change, not just the data inside it. Telling an LLM "the user has seen this three times, be simpler" produces marginally simpler output. Giving it a completely different response schema — different required sections, different format, different tone rules — produces a completely different response.

The distinction matters: data adaptation versus behavior adaptation. Most memory systems do the former. This needed the latter.


The Learning Engine

Errors get classified into six categories: loops, syntax, indexing, logic, functions, strings. The classifier uses keyword matching as a fast first pass:

CATEGORY_KEYWORDS = {
    "loops":    ["loop", "range", "while", "for", "iteration"],
    "syntax":   ["syntax", "bracket", "indent", "colon", "parenthes"],
    "indexing": ["index", "off-by-one", "out of range", "subscript"],
    "logic":    ["logic", "condition", "comparison", "operator", "boolean"],
}

def classify_to_category(error_type: str, topic: str = "") -> str:
    combined = (error_type + " " + topic).lower()
    for cat, keywords in CATEGORY_KEYWORDS.items():
        if any(kw in combined for kw in keywords):
            return cat
    return "general"
Enter fullscreen mode Exit fullscreen mode

After classification, Hindsight's memory API gives us every past record for this user. We count occurrences per category:

def build_repeat_counts(self, past_mistakes: List[Dict]) -> Dict[str, int]:
    counts: Dict[str, int] = defaultdict(int)
    for record in past_mistakes:
        meta = record.get("metadata", {})
        cat  = meta.get("category") or classify_to_category(
            meta.get("error_type", ""), meta.get("topic", "")
        )
        counts[cat] += 1
    return dict(counts)
    # Returns: {"loops": 3, "syntax": 1}
Enter fullscreen mode Exit fullscreen mode

Then a decision function maps count to mode:

def decide_mode(repeat_count: int) -> str:
    if repeat_count == 0:   return "normal"
    elif repeat_count == 1: return "reinforced"
    elif repeat_count == 2: return "step_by_step"
    else:                   return "simplified_with_analogy"
Enter fullscreen mode Exit fullscreen mode

When a submission contains multiple error categories, each gets its own mode. The dominant_mode function picks the most escalated one to set the overall response structure. If a user has three loop errors and one syntax error, the loop category drives the response format while the syntax error gets normal treatment.


What Behavior Change Actually Looks Like

Same user, same loop bug, submitted four times across sessions. Here is what each response looks like structurally — not paraphrased, this is the actual format the LLM is instructed to produce:

Submission 1 (repeat_count = 0, mode: normal):

"The loop accesses arr[i+1] which causes an IndexError on the final iteration since i reaches len(arr)-1. Change the range to range(len(arr)-1) or use enumerate()."

Submission 2 (repeat_count = 1, mode: reinforced):

"You've seen this before — the pattern is accessing an element one position ahead of the current index. This keeps happening because range(len(arr)) includes the last index, but arr[i+1] tries to go one further. Rule: whenever you write arr[i+1] inside a loop, always ask if the range boundary accounts for it."

Submission 3 (repeat_count = 2, mode: step_by_step):

Step 1 — What is wrong: Line 2 accesses arr[i+1] when i = len(arr)-1
Step 2 — Why it breaks: Python raises IndexError because index len(arr) does not exist
Step 3 — The fix: Change range(len(arr)) to range(len(arr)-1)
Step 4 — Verify: The loop now stops one index short, arr[i+1] always has a valid target

BEFORE: for i in range(len(arr)): print(arr[i+1])
AFTER:  for i in range(len(arr)-1): print(arr[i+1])

Submission 4 (repeat_count = 3, mode: simplified_with_analogy):

ANALOGY: Imagine you're handing out programs at a 10-seat theater. You have
seats 0 through 9. If you try to hand a program to seat 10, it doesn't exist.
Your loop is doing exactly this — asking for a seat that was never built.

WHAT THIS MEANS IN CODE: range(len(arr)) gives you indices 0 to 9 for a
10-element list. Asking for arr[i+1] when i=9 is seat 10 — it does not exist.

THE ONE RULE: If you write arr[i+1] in a loop, your range must stop one early.

This is not rephrasing. The response schema is structurally different each time. The LLM is given different section headers it must fill, different format requirements, different tone constraints. The data inside changes because the user's history changes. The shape of the response changes because the decision engine says it must.


What Surprised Me

The keyword classifier was good enough. I spent two days building a more sophisticated embedding-based error classifier before realizing the keyword approach covered 90% of cases and was fast, deterministic, and debuggable. The fancier version introduced latency and hallucination risk without meaningfully better categorization.

The prompt structure mattered more than the prompt content. I spent a week tuning language in the prompt — better phrasing, more specific examples, clearer tone guidance. None of it moved the needle as much as replacing the instruction "be simpler" with a required output schema that had different sections. The LLM follows structure more reliably than it follows tone instructions.

Memory without a decision layer is expensive context stuffing. Without the decision engine, I was pushing 30 past records into every prompt and hoping the model synthesized something useful. It didn't. With the decision engine, I'm pushing a handful of targeted records plus explicit structural instructions. The responses got better while the token count dropped.

The two-LLM-call architecture was unavoidable. One fast call identifies the error categories. One call generates the adaptive response. I tried collapsing them into a single call. The model would frequently misclassify errors or apply the wrong teaching mode when asked to do both in one shot. Separating the concerns — classification versus generation — fixed this.


Lessons

  • Memory without behavior change is just a longer context window. If the agent's response structure doesn't change based on history, memory is decoration.
  • Deterministic decision logic beats prompt-based reasoning for escalation. Don't ask the LLM to decide when to switch modes. Write a four-line Python function and call it.
  • Category-level tracking is more useful than individual error tracking. "This user has made 3 loop-boundary errors" is more actionable than "this user made these three specific errors."
  • Force structural variation in the prompt, not just tonal variation. Required output sections that differ per mode are far more effective than adjectives like "simpler" or "more detailed."
  • The UI matters more than you think for demonstrating intelligence. Before building the before/after toggle, people couldn't easily see that the responses were structurally different. Making the behavior change visible to a non-technical observer is its own engineering problem.

This Is Not RAG

Standard RAG retrieves relevant documents and adds them to context. What I built uses memory differently: past interactions are evidence that feeds a decision engine, and the decision engine changes what the LLM is asked to produce, not just what it knows.

The distinction: RAG gives an LLM more information to answer the same question. This system uses past information to answer a different question — specifically, "given that this user has failed this way three times, what teaching structure has the best chance of breaking the pattern?"

That's closer to how a human tutor operates. A good tutor doesn't explain the same concept the same way indefinitely. After the second failure they slow down. After the third they draw a diagram. After the fourth they use an analogy. The content stays the same. The delivery method changes.

The full memory architecture runs on Hindsight, which handles the persistence and semantic retrieval cleanly. The learning engine on top of it — the categorization, the counting, the mode switching — is about 250 lines of Python. The complexity lives in that layer, not in the memory store or the LLM. Which is probably the right place for it.

The code isn't doing anything exotic. The idea is the part that took time to get right.

Top comments (0)