I built these systems with heavy use of AI coding tools — prompting, directing, and iterating. The experiments and lessons are real.
Most AI agents are goldfish.
Every conversation starts from scratch. The agent has no idea what you asked yesterday, what format you prefer your answers in, or that it made the same mistake three sessions ago and you corrected it. For simple Q&A, this is fine. But for agents that are supposed to work with you over time — support agents, analytics assistants, coding copilots — this is a fundamental problem.
So I set out to add memory to an agent. How hard could it be?
This is the story of what I tried, what broke, and what I actually learned.
The naive mental model (and why it falls apart)
Before building anything, my mental model was simple:
- Store useful things the agent learns during conversations
- When a new conversation starts, retrieve relevant memories
- Inject them into the prompt
- Agent is now smarter
Steps 1 through 4 are all correct in principle. But every single one hides a minefield.
There’s another shift that’s easy to underestimate.
Memory makes your agent non-deterministic. Two identical prompts from two different users can now produce different responses based on historical state. Even the same user can get different answers across sessions depending on what was previously stored or retrieved.
This complicates debugging, reproducibility, and evaluation. When something goes wrong, you’re no longer inspecting just a prompt and a model — you’re inspecting a dynamic system with hidden state. Memory doesn’t just make the agent smarter. It makes the system more complex.
Attempt 1: Just store everything and retrieve by similarity
My first approach was the obvious one. After each conversation, summarize it into a memory entry. Store the embedding. When a new conversation comes in, embed the user's query, find the most similar memories, and inject them into the system prompt.
I set up a retrieval pipeline using dense vector search, embedded the memories, and ran an evaluation comparing the agent's responses with and without memory.
What happened: The agent got slightly better on average. But the retrieval gate was passing nearly 100% of the time — almost every turn got memory injected, whether it was relevant or not.
This is the first thing nobody warns you about: retrieval over-injection.
When you retrieve by similarity alone with a loose threshold, everything looks "somewhat similar" to everything else. The agent starts getting memories about shipping policies when the user is asking about product dimensions. The memories aren't wrong — they're just irrelevant. And irrelevant context is not neutral. It dilutes the agent's attention on the actual conversation and can bias it toward patterns from past sessions that don't apply.
The lesson: Retrieval quality matters more than memory quality. You can have perfect memories, but if you inject them at the wrong time, you make the agent worse.
Attempt 2: Calibrate the retrieval gate
The fix seemed straightforward — tighten the similarity threshold so only genuinely relevant memories get through. But what threshold?
I couldn't just pick a number. The right threshold depends on your embedding model, your memory content, and the distribution of queries you're serving. Too tight and memory never fires. Too loose and you're back to over-injection.
So I built a calibration step. The idea: define a target hit rate (what fraction of turns should plausibly get memory), then search for the threshold that lands in that band. I aimed for roughly 40–50% of turns getting memory injected — a range where retrieval is selective but not so rare that you can't measure its effect.
After calibration, things improved noticeably. The agent's responses were better when memory fired, and critically, they weren't worse when memory didn't fire. The evaluation showed clearer gains on the metrics I cared about.
The lesson: Treat retrieval as a decision, not a lookup. The threshold is a hyperparameter that needs tuning, just like learning rate in training. And you need an offline calibration step — you can't eyeball this.
Attempt 3: Better memories through distillation
With retrieval somewhat under control, I turned to the memory content itself. My initial memories were raw conversation summaries — verbose, overlapping, and noisy. When I synthesized memories from hundreds of conversations, I ended up with hundreds of candidates that said similar things in slightly different ways.
I tried a multi-stage distillation pipeline:
- Synthesize memory candidates from conversation shards in parallel
- Merge duplicates using lexical deduplication (matching on key fields)
- Semantic dedup using embeddings and clustering — collapse near-duplicate memories into representative entries
- Refine the survivors using a stronger model to clean up language and consolidate related entries
This brought my memory bank from hundreds of noisy candidates down to a few dozen high-quality entries. The refined memories were cleaner, more actionable, and less redundant.
But I hit practical problems along the way. Embedding APIs have batch size limits that aren't always documented. Refinement calls with large context can time out if you're not generous with your HTTP timeouts. Clustering with a fixed similarity threshold can either over-collapse (losing distinct memories) or under-collapse (keeping near-duplicates). I ended up adding topic-balanced selection before refinement to make sure I wasn't losing coverage of important categories.
The lesson: Memory distillation is its own engineering problem. Don't expect a single LLM call to turn noisy candidates into clean memories. It's a pipeline, and each stage has failure modes.
Attempt 4: Hybrid retrieval (not just vectors)
Pure dense vector retrieval has a known weakness: it's great at semantic similarity but can miss lexical matches that matter. If a memory mentions a specific product code or error message, vector search might rank it lower than a semantically "close" but lexically different entry.
I added BM25 (keyword-based) retrieval alongside the dense vectors and combined them using Reciprocal Rank Fusion (RRF). The hybrid approach gave me the best of both worlds — semantic understanding from vectors, exact matching from BM25.
Of course, the first run with hybrid retrieval silently fell back to BM25-only because the dense index had a bug (the vector database expected numeric IDs and I was passing string IDs). The pipeline completed successfully — it was designed to fail open — but I was measuring BM25 performance, not hybrid performance. I only caught this because I had retrieval mode tracking in the diagnostics.
The lesson: Hybrid retrieval is worth the complexity, but you must instrument your retrieval path. Know which backend actually answered each query. Fail-open designs are good for reliability but dangerous for evaluation — you can unknowingly evaluate the wrong system.
Attempt 5: The hard cases don't move
By this point, my memory system was showing consistent, modest improvements on the overall evaluation. Responses were better with memory than without. The pairwise comparison showed the memory-augmented agent winning about 49% of head-to-heads versus 36% for the baseline, with 15% ties.
But when I sliced the results by difficulty — looking specifically at the hardest conversations, the ones with the most friction — memory made no difference.
This was the most important finding of the whole project.
The easy and medium cases improved. The hard cases were untouched. This makes intuitive sense once you think about it: the hardest cases are hard because they involve novel situations, edge cases, or complex multi-step problems that your memory bank simply doesn't cover well. Memory helps the agent avoid known mistakes and apply known patterns. It doesn't help with the unknown.
I hadn't yet decomposed why the hard cases didn't move. It could be a coverage gap (no useful memory exists for these situations), a retrieval miss (relevant memory exists but wasn't surfaced), or a utilization miss (relevant memory was injected but the model didn't act on it). Each has a different fix. Untangling this is high on my list of things to explore next.
The lesson: Memory has a ceiling, and it's lower than you'd hope for the cases that matter most. If your goal is specifically to improve the hardest interactions, memory alone won't get you there. You need better retrieval relevance for hard cases, better memory coverage of edge cases, or fundamentally different approaches for the tail.
The side experiment: can you just bake memory into the prompt?
At some point I asked the obvious question: do I even need a retrieval system? What if I just take everything the agent has learned and bake it directly into the system prompt?
I tried this — took the distilled memory and grounded it into the system prompt, then ran the agent without any retrieval component.
The results were surprising. The prompt-grounded agent showed better progress toward resolving user issues, but it also triggered more clarification requests and more user frustration. Especially on the hardest cases, the friction got worse.
My best explanation: without retrieval gating, there's no mechanism to decide which knowledge is relevant right now — the agent applied memories indiscriminately, which made it more ambitious but also more presumptuous. Though I should note that the increased prompt length itself may also play a role. I was changing two things at once (removing the relevance filter and increasing prompt size), so I can't fully isolate the cause.
The lesson: The retrieval gate isn't just a performance optimization — it's a relevance filter. Baking everything into the prompt is like giving someone a briefing document for every customer before every conversation. Some of it helps. A lot of it creates false confidence.
Attempt 6: Closing the loop with utility scoring
The last major thing I explored was making the memory system learn from its own performance. Inspired by the MemRL paper (which applies reinforcement learning ideas to agent memory), I experimented with adding utility scores to memories.
The idea: after each conversation, look at which memories were retrieved and what the outcome was. If retrieved memories correlated with good outcomes, increase their utility score. If not, decrease it. Over time, the retrieval system favors memories that actually help, not just memories that look relevant.
I implemented this as a two-phase retrieval:
- Phase A — standard semantic search to find candidate memories
- Phase B — re-rank candidates by a blend of similarity and utility score
The Q-value update uses a simple Bellman-style rule with a conservative learning rate, so a single bad outcome doesn't tank a previously good memory.
This is conceptually elegant and I believe directionally correct. But I'll be honest: I haven't yet run this long enough in a production-like setting to report confident results on whether the utility loop meaningfully improves things over time. The credit assignment problem — knowing which memory caused the outcome — is real and unsolved in my implementation. A session might fail for reasons totally unrelated to the retrieved memory, but the memory's score still takes the hit.
The lesson: Closed-loop memory learning is the right direction, but credit assignment is genuinely hard. Correlation between "memory was present" and "session went well" is not causation. This is where I need to go deeper, and it's a big part of what I'll cover next.
What I'd tell someone starting this journey
Start with retrieval quality, not memory quality. Your first bottleneck will be injecting memory at the wrong time, not having the wrong memories.
Instrument everything. Track which retrieval backend answered, what the similarity scores were, whether memory was injected or gated, and what the outcome was. You will debug more retrieval issues than memory issues.
Calibrate your retrieval threshold offline. Don't guess. Define a target hit rate and search for the threshold that achieves it.
Distillation is a pipeline, not a prompt. Dedup, cluster, refine, and curate your memories through multiple stages.
Evaluate on hard cases separately. Overall metrics will make you optimistic. Hard-case metrics will keep you honest.
Don't skip the "no memory" baseline. Memory's value is comparative. Without a clean baseline on the same eval set, you're measuring nothing.
The retrieval gate is a feature, not a limitation. It prevents the agent from applying stale or irrelevant knowledge. Removing it (prompt-only grounding) can make things worse.
What I'm exploring next
There's a lot I still need to learn. The big open questions for me:
- Credit assignment — how to attribute outcomes to specific memories rather than just correlating retrieval with results
- Memory representations — natural language chunks vs. structured/graph-based approaches
- Memory interference — when injected memories cause the agent to ignore important context in the current conversation
- Scaling — how memory quality degrades as the memory bank grows from dozens to thousands of entries
- Theoretical grounding — connecting this work to the cognitive science of how memory actually works (complementary learning systems, consolidation, forgetting)
I'll write more once I've dug into these and have something worth sharing.
If you're working on memory for agents and have run into similar (or different) walls, I'd love to hear about it.
Top comments (0)