When Four Memory Systems Hit the Same Wall

#ai #architecture #knowledgemanagement #toolsforthought

I built a knowledge graph out of my own work sessions. Hundreds of them — transcripts of me building a system with LLMs, extracted into concepts, decisions, findings, and the edges between them. For a while it felt like the thing was working. I'd query it, get back a clean structured answer, and move on.

Then I ran a foreign model against it. I gave a different model my concept definitions and asked it to reconstruct the system, both the vocabulary and the relationships. It recovered 97.7% of the words. It recovered 61.1% of the structure.

That 36-point gap was the first time I could see the problem instead of just living inside it. The vocabulary transferred because the definitions were written carefully. The edges didn't, because the edges were the part I'd let the extraction handle. And the whole time, querying the graph had felt complete. The structure came back typed, connected, confident-looking — so I stopped looking. I started calling it premature retrieval closure: the retrieval returns something shaped like a whole answer, which is exactly why I didn't notice the parts that were missing.

Part 10 of Building at the Edges of LLM Tooling. If you're running a long-term project through an LLM-backed memory system (anything that turns raw sessions into structured, persistent memory), this is about the step where the structure starts lying about how complete it is. Start here.

Why It Breaks

Every memory system of this kind does the same move. An LLM reads raw interaction (a conversation, a document, a session log) and lifts structured memory out of it: entities, facts, rules, summaries. That structured memory becomes the thing the agent reads later, instead of the raw record.

The lift is where fidelity goes. Pulling clean structure out of messy text means making decisions the text didn't make explicit: which entity this pronoun refers to, whether a relationship is real or inferred, what to keep and what to drop. Those decisions can be wrong, and when they are, the error gets stored as structure: typed, indexed, ready to retrieve.

What makes it hard to catch is the second half. The output of extraction looks authoritative. A typed entity with a confidence score and three edges reads as more trustworthy than the paragraph it came from, not less. The polish is real even when the fidelity isn't. So the gap between what was said and what got stored becomes invisible at exactly the point where I'd want to catch it — when I query and get a confident answer back.

That's the wall. Not "extraction is hard," which is hardly news. The wall is that extracted structure manufactures a feeling of completeness the underlying extraction can't actually back.

What I Tried, and Where I Saw It Everywhere

My own fix wasn't better extraction. It was demotion. I stopped letting the extracted graph be the source of truth and kept it as derived, secondary evidence. The raw session record stays primary. The graph is allowed to be wrong, because nothing answers from it as ground truth. That's a design choice I made to keep the structure from being load-bearing; I can't claim it makes the extraction itself any more faithful.

What I didn't expect was how much company I had. Once I knew the shape of the problem, I started seeing it in projects far more serious and better-resourced than mine, each acknowledging the difficulty somewhere and each treating the extraction step itself as the part that's basically handled.

Letta, the project that grew out of the MemGPT research, handles long conversations with compaction: when the context outgrows the window, it summarizes the oldest messages to make room. The raw messages stay in a database; the summary is what the agent sees. Their own issue tracker is the most honest documentation of what that costs. One open issue describes compaction running twice and summarizing an already-summarized history: what it calls "lossy compression on lossy compression." Another describes a mode that silently wipes the full history and leaves only the summary. The summarizer prompt itself pleads with the model to "preserve identifiers verbatim." The machinery is built by people who clearly know where the loss lives.

CASS Memory System, by Jeffrey Emanuel, names the problem as a reason for its design. Its documentation says plainly that "naive summarization loses critical nuances and details," and its response is to remove the LLM from the final merge step entirely, a deterministic curator, to keep the model from drifting as it refines its own output. That's a careful defense against iterative drift. But the step that feeds the curator is still an LLM reading a session and writing rules, and the validator checks whether a rule was useful elsewhere, not whether it faithfully reflects the session it came from. One issue notes a real playbook where 99% of rules sat unvalidated.

Volodymyr Pavlyshyn's agentic-memory architecture, built on the LadybugDB graph database, lifts raw conversation up four layers: entities, then facts, then events, then memories. He builds a certainty score into every extracted fact (stated, implied, inferred, speculative) and flags "extraction errors" as something detectable in the graph's shape. The mechanics of the pipeline are specified in detail. The conceptual chapter on the extraction process itself is, in his own design doc, still an unwritten checklist, which reads to me as an honest marker of where the settled part ends.

Hyperspell sells a memory layer that indexes your connected accounts into a "memory graph," extracting the important people, projects, and facts automatically. On the marketing surface, extraction is the easy part — plug it in and it works. But the API tells a quieter story: it includes conflict detection, staleness checks, three-way merges, and an endpoint for scoring results so they improve over time. That machinery only needs to exist if the extracted memory drifts from the source. The correction layer is the admission the landing page doesn't make.

What It Revealed

Four architectures, four teams, the same shape. None of them is naive about memory: Letta documents its losses in public, CASS engineers around drift, Pavlyshyn scores his own uncertainty, Hyperspell ships a correction stack. The difficulty is acknowledged in all four. It's just acknowledged next to the extraction step rather than at it. The summarization, the rule-writing, the entity-lifting, the indexing (the actual unstructured-to-structured transform) is the part each system treats as solved enough to build on.

And the thing they build on top is the thing that hides the gap. A merged graph, a scored fact, a distilled rule, a recursive summary — each is a structure that looks complete. The more sophisticated the structure, the more convincing the completeness, and the less anyone re-checks it against what was actually said. My 61.1% was the same wall with a number attached: the structural layer was where my system was weakest, and it was also the layer that had felt most done.

The Reusable Rule

If you're running a long-term project through an LLM-backed memory system, find out one thing: what does your agent treat as the source of truth?

The diagnostic: when your system answers from memory, is it reading the thing that happened, or a confident summary of the thing that happened? Trace one answer back. If it bottoms out in the raw record, the structure is a convenience. If it bottoms out in extracted structure with no path back to the source, the polish is doing work the fidelity hasn't earned.

Extracted structure looks more complete than it is. Build so that the thing it's a summary of is still there to check.

Update (June 14, 2026): I tested the diagnostic this post ends with — against my own system. Using an off-the-shelf faithfulness metric (RAGAS) with my manual verdicts as labels, claim-vs-source checking reproduced my hand rankings cleanly, at about three cents per claim. Then I ran it over a sample of my own knowledge graph: 37% of edges were fully supported by the store's own content; 27% had no in-store support at all — true edges whose evidence lives in sessions the edge never references. The structure looking complete while two-thirds of it can't self-justify is the wall this post described, measured on the system that wrote it. Verification turned out to be cheap. Evidence that travels with claims is what's missing.

DEV Community

When Four Memory Systems Hit the Same Wall

Why It Breaks

What I Tried, and Where I Saw It Everywhere

What It Revealed

The Reusable Rule

Top comments (0)