DEV Community

Cover image for "Memory adherence is a systems problem. So which model lets you build the system?"
Todd Hendricks
Todd Hendricks

Posted on

"Memory adherence is a systems problem. So which model lets you build the system?"

https://dev.to/krupali_gadhiy Left a comment that felt like a good way to end this series. They asked me if I could notice a difference in the hyperscale frontier models. This post is a breakdown of where I am at, the carve-outs, and direction.

There are differences, but they are not really about the model. They are about how much of the turn each family lets you control. If adherence is a systems problem, then the thing that actually decides write reliability is which family hands you enough control surface to build the system in the first place.

Treat this as a field report, not a benchmark, and an early one. I have not pushed every family equally hard, and that matters for how much weight the comparison can carry.

Claude is where I have gone the deepest

This is the family I have spent the most time with, so weigh the claim with that in mind. What I can say is that Claude gives you a full ladder, from shallow to deep, and you can stop at whatever rung the job needs.

The shallow rung is nudges: the settings page and system instructions. This is the weakest form, and it is the one that decays, the same "tell it harder" failure I wrote about last time. It helps a little, and it does not hold.

Above that are skills and hooks. Hooks are the first rung where adherence stops being a request and becomes an event: something fires at the start of the turn, at write time, at the end, whether the model felt like it or not. They work. They can also be finicky to get right.

The deep rung is the SDK. With it, you get targeted control of the model's turn, the whole prompt-to-response lifecycle, beginning to end. This is where the read reason response write wiring carries actual weight. Nothing else I tested hands you the full lifecycle like this.

The point is not that any single rung is magic. It is that Claude lets you go as deep as the reliability you need, and the deeper you go, the more deterministic the write becomes.

ChatGPT and Codex are close, through AGENTS.md

Codex was a lot easier than I expected, because its instruction set is the AGENTS.md file. That one anchor did most of the work, and the net result was only slightly less consistent than Claude.

Honest caveat: I have not dug into the Codex SDK yet. So the ceiling there is probably higher than what I have measured, and the real gap to Claude may be smaller than it looks right now.

Gemini and Grok lean on their own memory

The Gemini and Grok CLIs are a different story. They seem to want to do the least amount of work possible, and they lean on their own KV and incoming user prompts more than they reach for an outside store. You end up working against a default that would rather not call your database at all.

The exception is WebSearch, and it comes back noisy and minimal, and it gets actively counterproductive when it does surface strongly correlated evidence that's divergent from your store, because then the model over-trusts the search and skips the store it should have read. Getting real adherence out of those two means digging into their architectures to optimize for it, and that is still on my horizon, not something I have solved.

Why this is a field report and not a benchmark yet

Building SENTINEL, a real benchmarking suite for this, is one of my higher priorities right now. The honest problem is that most memory systems do not expose the turn the way you would need to in order to measure whether a write happened at the right moment, so you cannot make them play the same game. The way I keep it fair is to grade what each system's outputs actually demonstrate, on a capability ladder, rather than the levers it was never built to expose. A system that does not surface something scores lower on that axis without being punished for it, and I say plainly which axis is push and which is pull, so a gap reads as a different shape, not a worse one. The full design is its own post.

A benchmark value only means something if there's an aggregate baseline and a calibrated anchor. A bench that only your system can run is not a benchmark; it is a demo attributing the result to the wrong layer, blaming the store when the model just queried it badly, or crediting the system. So devising ways to avoid bias, cold start cutoffs, normalizations for balancing the different ways correlated axis report their values, scale parity, outlier skew, and correct attribution comes first, and then leaderboard means something, I came, I saw, I conquered.

It feels like it is more difficult to create an honest comparison between separate but similar function systems than the engineering of the thing being compared to something else.

The takeaway

The model is not the variable that matters most for write reliability. The control surface is. The deterministic parts of a good memory system, the validation gate, and the demotion math, hold the same on any of these. What changes across families is how much of the turn you are allowed to act on and instrument the input-output mechanics. So far, Claude exposed the whole lifecycle, Codex gave a strong instruction anchor through AGENTS.md, and Gemini and Grok leaned hardest on their own context. I went testicle deep on Claude's hooks and SDK, scratching the surface on Codex, its SDK next, and I have barely pushed Gemini or Grok past their defaults. Subjective verdict on which family is most controllable. Making that comparison fair, with equal effort and correct attribution, is exactly what SENTINEL is for.

So "which model writes more reliably" turns out to be the wrong question. The real one is which model lets you build the most system around the userprompt_init read write, and turn_end mechanics. That is the same point the commenter started with, followed all the way down.

github.com/H-XX-D/recall-memory-substrate

Top comments (0)