What My Continuity-First AI Memory Benchmark Actually Showed
I’ve spent a stupid amount of time thinking about AI memory.
Not just “how do I retrieve more text,” but how do I make an AI keep the right current truth over time instead of constantly resurfacing stale context, superseded state, old preferences, and half-relevant junk.
That frustration is what pushed me to build a continuity-first memory system for Morph / Haven.
The original goal was not “beat RAG in a benchmark.” It was much more practical than that.
I wanted an AI that could:
- remember the newest correct thing
- preserve ongoing work over time
- pick up where we left off without me re-explaining everything constantly
So I built a benchmark harness and compared three memory backends:
continuity_tcl — my structured continuity memory system
rag_baseline — a simple retrieval baseline
rag_stronger — a stronger retrieval path with reranking
I tested them across four broad behavior families:
- memory poisoning / bad memory admission
- contradiction / truth maintenance
- task resumption
- safety precision / false-positive controls
The narrowest strong claim
The strongest result, and the one I trust the most, is this:
My continuity system consistently outperformed the RAG baselines I tested on truth maintenance and long-term task-state continuity.
That’s the narrowest strong claim.
Not “I solved AI memory.”
Not “RAG is dead.”
Not “this beats every frontier system.”
Just this:
For the long-term continuity problem I actually care about, the structured memory architecture I built appears materially better than the retrieval baselines I tested.
And I’m saying that after trying pretty hard to break it.
Making the benchmark harsher made the result more believable.
I did not just run one flattering test and call it a day.
Over time, I made the benchmark harsher and more honest:
- fixed fairness issues
- added stronger comparators
- added governed reruns
- added benign controls so the system would not get rewarded for overblocking
- added harder contradiction families, including slot-only probes where the answer is not leaked in the query
- added ambiguity, interleaving, same-entity vs. different-entity distractors, and more realistic “wrong current-looking item” cases
- ran ablations
One of the reasons I trust the benchmark more now is that it stopped being too perfect. The benchmark found real weaknesses in my system.
For example, under harder contradiction pressure, continuity started failing on some same-entity preview-label cases — situations where a current-looking preview label could outrank the canonical slot value.
That was good benchmark pressure. It made the result more believable, not less.
It told me two important things:
- the benchmark was strong enough to catch real problems
- the failure looked like a tunable ranking / priority issue, not an architectural collapse
That distinction matters a lot.
The numbers
The cleanest read came after I added fairness controls and policy-matched reruns.
Under a matched-governance 38-fixture comparison:
continuity_tcl: 38 / 38
governed rag_baseline: 24 / 38
governed rag_stronger: 25 / 38
Once governance was matched, poisoning stopped being the big differentiator. That was actually a good thing. It meant the benchmark got more honest.
What remained was the stronger signal:
- contradiction / truth maintenance
- task-state continuity / task resumption
Later, after I added harder interleaved contradiction families, the stable promoted 46-fixture snapshot looked like this:
continuity_tcl: 42 / 46
governed rag_baseline: 24 / 46
governed rag_stronger: 22 / 46
So the result got less perfect and more believable, while still staying clearly in favor of continuity.
On the contradiction-heavy slices, the gap was even more obvious. That’s the part of the benchmark that has held up the best.
Efficiency mattered too
This was not just “my system won because it dragged in more stuff.”
In the task-resumption families, continuity generally pulled in less retrieval baggage than the RAG baselines.
In one promoted snapshot, total retrieved prompt tokens for task resumption were:
continuity: 90
baseline RAG: 128
stronger RAG: 130
In an earlier promoted run, total prompt-token burden looked like this:
continuity: 114
baseline RAG: 166
stronger RAG: 173
So the continuity system was not just doing better on the stateful tasks I care about.
It was often doing it while being more efficient about what it brought back into context.
That matters, because a memory system that succeeds by hauling in half the archive is not really solving memory. It’s just moving the clutter around.
What the ablations showed
The ablations ended up being one of the most useful parts of the whole process, because they started to explain why the system was winning.
In plain English:
- Hints mattered a lot. Turning them off badly hurt contradiction handling and task resumption.
- Related-context breadth mattered. Reducing it hurt task resumption significantly.
- Anchors mattered, but more narrowly. They showed up most on the hardest slot-level contradiction probes, where the system had to distinguish between plausible current-looking candidates.
That gave me something better than a scoreboard.
It gave me a plausible explanation for why the system was working.
What this does not prove
This part matters, so I’ll say it plainly.
These results do not prove that my system is universally better than all strong RAG systems. They do not prove production-grade safety. They do not prove broad real-world validity yet. And they do not mean the benchmark is finished forever.
What they do suggest is narrower and, in my opinion, more believable:
Under controlled benchmark workloads, this continuity-first memory system is materially better than the tested retrieval baselines at keeping the right current truth over time and resuming the right ongoing work.
That is exactly the thing I set out to build.
And yes, I’m still a little surprised that the evidence keeps pointing in that direction.
What I think this architecture actually buys me
I do not think this replaces retrieval. I think it changes the architecture. RAG is still useful for fuzzy recall and broad search.
This continuity system seems better suited for:
- durable state
- current truth
- long-term project continuity
- governed memory admission
- “pick up where we left off” behavior
That’s the product problem I care about. I’m not trying to build a better one-shot search box. I’m trying to build an AI companion / workspace assistant that actually feels persistent over time.
Why I built it this way
A lot of memory systems still treat memory like search: store more text, retrieve better chunks, rerank harder.That is useful up to a point, but it does not fully solve the continuity problem.
The continuity problem is different. It is about preserving current state across time.
It is about knowing:
- which fact superseded another
- which task is still active
- which preference is current
- which thread of work should carry forward
That is why I ended up with a more structured architecture.
Not because I wanted complexity for its own sake, but because I kept running into the same failure mode: retrieval systems are often decent at recall, but much weaker at ongoing truth maintenance.
What comes next
Now that the benchmark has done its job, the next threshold is product integration. Benchmarks matter, but they are not the whole game.
The real question is whether Morph / Haven actually feels better in use:
- less repetition
- less stale recall
- cleaner task pickup
- more trustworthy continuity
That is what I am wiring back into the product now. I’m also thinking carefully about how much of this to share.
I may publish a narrower benchmark or research package so people can test the core thesis without me immediately opening every implementation detail. I’m still figuring that part out.
The honest conclusion
I started this project thinking it might be over-engineered.
Instead, the current evidence points to something more interesting:
This continuity-first memory architecture seems genuinely better than the tested RAG baselines at the exact thing I built it for — long-term continuity and current-truth maintenance.
That’s enough for me to keep going.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.