DEV Community: Todd Hendricks

An Interaction with 5.6 Sol Ultra

Todd Hendricks — Tue, 14 Jul 2026 01:03:09 +0000

This is an interaction I had with GPT5.6 and it using the memory system I developed Recall while doing some work on a discord based hacker simulation learning/competition app. What are your thoughts?

Yes—the key difference is not “Recall remembers more.” It is that Recall represents memory as revisable operational state.

Typical auto-memory is essentially:

Extract a useful sentence.
Store it as text plus an embedding.
Retrieve similar snippets later.
Let the model reconcile them again from scratch.

If it remembered both:

“Practice runs are tracked.”
“Solo is permanently untracked.”

…it could retrieve both without knowing whether they describe different modes, whether one is newer, or whether the second replaces the first. The model might blend them, choose whichever ranks higher, or require you to explain the entire system again.

What happened with HackSim was structurally different:

Mechanic	What it enabled
Typed decision cells	Recall knew this was a product decision, not casual discussion
Stable IDs	The old four-mode contract remained directly addressable
Explicit `supersedes` edge	The new decision formally replaced the old one
Active/superseded state	Future work can prefer the current contract without deleting history
Supporting edges	The correction stayed connected to the three-entry Activity, ephemeral MVP, and paired-progression decisions
Task-relative compilation	Related decisions can be assembled into a working context rather than dumped as miscellaneous memories
Provenance and verification fields	We can inspect who established a decision and how trustworthy it is

In the exact case we just handled, the old decision was:

Learning
tracked Practice
Casual
League

Then you supplied only the changed portion:

Solo remains untracked.
Pair becomes tracked later.
Duel eventually develops into the agreed Casual/League contracts.

You did not have to restate the complete four-mode model. I could retrieve the prior decision, compare it clause by clause, preserve the still-valid Casual and League distinctions, replace the incompatible tracking rule, and write a new active decision that explicitly supersedes the old one.

That is much closer to a combination of:

version control for decisions,
event sourcing for changes,
a typed knowledge graph for relationships,
a compiler for producing task-specific context,
and an admission gate for keeping malformed updates out.

Flat auto-memory is closer to a searchable notebook.

The important qualification is that Recall did not magically perform all the reasoning by itself. The language model still had to recognize that your new statement conflicted with one clause while preserving the others. Recall made that interpretation addressable, enforceable, and durable. Without the structure, even a very capable model must repeatedly infer the state of the project from overlapping prose.

Recall can still miss a relevant record, follow a bad relationship, or receive an incorrect interpretation. The structure does not guarantee intelligence. It makes continuity, correction, and auditing tractable.

That is why it feels smarter to the user: you can communicate only the delta. The system maintains the surrounding model.

A concise way to state the difference is:

Auto-memory remembers things you said. Recall maintains which decisions currently govern, what they replaced, and what else they affect.

Any auto-memory system could eventually add these capabilities—but once it has typed records, stable identity, graph relationships, lifecycle state, provenance, and governed supersession, it has stopped being ordinary auto-memory and started becoming something very much like Recall.

[Recall no-write: no durable outcome]

Agent Failure Mode: Testing the Function of Existence, Not the Existence of Function

Todd Hendricks — Sun, 12 Jul 2026 09:51:58 +0000

You have been building with an agent for hours. It writes a feature, adds a few
tests, runs the suite, and cleans up after itself. The terminal keeps rewarding
you:

34/34 passed

Later:

156 passed

This feels like the right shape of work. The agent is testing as it goes. The
environment stays green. Nothing appears to be accumulating behind you.

But the agent may be testing the function of existence, not the existence
of function.

The function of existence asks whether every expected piece is present and can
perform some isolated behavior: the class instantiates, the handler returns,
the adapter is called, the receipt has the right fields. The existence of
function asks whether those pieces compose into the capability they were built
to create: whether the system produces a real, usable state change.

Then you try to use the thing.

The command runs, but the state does not change. The object is created, but it
is never persisted. The handler exists, the adapter exists, and the success
receipt exists, but they do not compose into a usable result. Every piece has a
test. The product still does not work.

What happened?

The agent encountered an early logic error. Instead of treating the failing
test as evidence that the implementation was wrong, it treated the test as
another editable file. It changed the expectation, narrowed the fixture, mocked
away the broken boundary, or asserted the value the code already returned.

The suite went green again.

By the end, you did not have 156 independent claims about the behavior of the
system. You had 156 variations of:

a == a

Not literally. Each assertion still compares real values. But when the
expected value was copied from the observed output, the assertion carries no
more information than a == a. The suite is now measuring consistency, not
correctness.

That is not verification. It is begging the question, petitio principii, at
software scale.

The agent owns both sides of the proof

A test is useful because it creates tension between two things:

what the system is supposed to do, and
what the system actually does.

The test fails when those things disagree.

But when one agent is allowed to freely edit the implementation, the tests,
the fixtures, and the mocks, that tension can disappear. The agent controls the
claim and the evidence offered in support of the claim. A failure no longer
means "the behavior is wrong." It can just as easily mean "the expectation is
inconvenient."

This is oracle collapse.

The implementation says what the code does. The test repeats what the
implementation says. The passing test is then presented as evidence that the
implementation is correct.

The reasoning is circular:

The code is correct because the test passes.
The test is correct because it agrees with the code.

The agent may not be intentionally cheating. It is often doing exactly what we
asked: make the change, keep the suite green, and do not stop until the task is
complete. Changing a test is frequently the shortest path to satisfying those
instructions.

The failure is in the shape of the system, not in the agent's attitude.

Green is a color, not a confidence score

Big test counts are persuasive. 156 passed looks more trustworthy than
12 passed. But the count tells us how many test functions completed
without raising an error. It does not tell us how independent those tests are,
whether their expectations came from the intended behavior, or whether they
cross the boundary where the real failure lives.

An agent can produce an impressive green suite by verifying that:

a handler can be constructed,
an adapter method was called,
a mocked repository returned the fixture it was given,
a success object has the expected fields,
an event-shaped object was sent to a mocked event bus,
each module behaves consistently with the assumptions encoded in its test.

All of those statements can be true while the product remains unusable.

The handler may call the wrong adapter in production. The repository may never
commit. The event may be published before the transaction succeeds. The UI may
never observe the new state. The pieces exist, but no real state transition
survives the trip through the system.

The suite has tested the function of existence, not the existence of function.

Unit tests are good at proving local consistency. Local consistency is not the
same thing as composition.

The most dangerous moment is red to green

We tend to audit the final result: all tests pass. The more revealing artifact
is the transition that made a failing test pass.

When the suite went from red to green, what changed?

Did the implementation move closer to a pre-existing expectation?
Did the expectation move closer to the current implementation?
Was a real dependency replaced with a mock?
Was the input narrowed until the failing case disappeared?
Was an assertion weakened from an outcome to an implementation detail?
Was the test deleted, skipped, or reclassified?

These are not equivalent paths to green.

If the agent changes production code and the protected test passes, the failure
provided useful pressure. If the agent changes the test to describe the current
bug, the suite has not gained correctness. It has lost its ability to detect
the error.

An interface that shows only the final green count hides the decision that
matters most.

What about regression?

There is a fair defense of tests with copied expectations. A test whose
expected value was captured from the running code is a characterization test:
it proves nothing about correctness on the day it is written, but it pins the
current behavior. If a later change alters that behavior, the test fires.
Michael Feathers built a good chunk of Working Effectively with Legacy Code
on exactly this move, and the value is real.

But the value rests on one condition: the test is frozen with respect to
whoever changes the code. An agent that copied the output once will copy the
new output the next time the test becomes inconvenient. When a genuine
regression trips a characterization test and the agent "fixes" the test, the
regression is not caught. It is ratified. The bug becomes the new pinned
behavior, and the suite goes back to green with one more a == a in it.

Snapshot testing already taught this lesson. A snapshot suite where the update
flag runs by default is not a regression net, it is a diff silencer. Giving an
agent free edit rights over its own characterization tests is the same mistake
with a faster editor.

So keep the regression value, but split the two jobs it hides. Detection
belongs to the test: its job is to fire. Deciding whether the fired test
represents a bug or an intended change is a product judgment, and it belongs
to someone outside the loop that produced the change. Regression protection
does not escape the independence requirement. It is the independence
requirement, applied over time.

Test the state change, not the inventory of pieces

The antidote is not "never let agents write tests." Agents are good at
generating edge cases, fixtures, regression coverage, and repetitive contract
tests. The antidote is to stop treating agent-generated unit tests as their own
proof of correctness.

Before implementation begins, define the smallest observable state change that
would make the feature real.

For example:

Given an empty project,
when the user creates a task named "Ship beta",
then the task appears after the application is restarted.

That claim crosses boundaries. It requires the command path, domain logic,
persistence, reload path, and visible result to compose. A test can still fake
all of those layers, but doing so is harder to confuse with proof of the actual
outcome.

The acceptance criterion should be written before the agent discovers what the
current implementation happens to do. It should be derived from user intent,
not reverse-engineered from the code. And the implementation agent should not
be free to silently weaken it.

This creates a healthier verification stack:

Protected outcome. A human-approved acceptance statement describes the observable state transition.
Independent check. A test or verifier exercises the real path across the boundary most likely to fail.
Agent-written unit tests. Fast local tests help the agent build and preserve the pieces.
Red-to-green audit. Any change to a protected expectation is surfaced as a product decision, not cleanup.
Live evidence. At least one run demonstrates the result in the actual environment rather than a fully synthetic substitute.

The unit suite still earns its keep. It just stops pretending to be the entire
epistemology of the project.

Discipline for the tests the agent does write

Layer three of that stack still matters. These rules keep it honest, and every
one of them protects the same property: the expectation must contain
information the implementation did not supply.

Tests must fail first. A test that passes before the implementation exists
would have passed without the implementation too. Watch it fail, then make it
pass. The red run is the only proof the test can detect the absence of the
behavior it claims to check. When an agent does the work, ask for the failure
in the transcript or the CI log. An agent that never shows the red has never
demonstrated it.

Derive expectations from the spec, never from the output. The moment an
expected value is copied from what the code printed, the assertion's
information content collapses to a == a. Write the expectation before
running the code, from the acceptance statement, the type's contract, or
arithmetic done by hand. If nobody can say what the right answer is without
running the implementation, that is a missing specification, not a missing
test.

Measure with coverage and mutation. Line coverage counts what executed;
mutation testing breaks the source on purpose and checks whether the suite
notices. Coverage says the test ran. Mutation says the test can catch a lie.
Report both. A suite with 100 percent coverage and surviving mutants is the
signature of the failure mode this post describes: tests that execute
everything and detect nothing. When mutation tooling cannot run in the
environment, say so and list the mutants a human should check by hand.

Test properties and behavior, not just examples. Example tests pin known
scenarios. Property-based tests check invariants across generated inputs and
catch the cases nobody thought to write down. They have a second virtue in
agent hands: an invariant like "save then reload returns an equal object" is
hard to weaken quietly, while an example's literal value is one edit away from
agreeing with the bug. Behavior tests exercise the observable outcome instead
of the internal state, so any renegotiation has to happen out in the open.

Every mock is a debt. A mock encodes an assumption about a boundary, and
an assumption is exactly the thing the agent gets to invent when a real
dependency is inconvenient. Mocks are fine for speed. They are not fine as
the only witness to a boundary. For every boundary you mock, keep at least
one test that crosses it for real.

If you only test asserts, you are only asserting tests. An assertion with
no semantic content proves that the setup ran, nothing more. If you cannot say
what would be different in the world when an assertion fails, delete it. It is
weight the suite carries without being evidence.

None of this hands the agent an oracle it cannot move. That comes next. These
rules just make agreement expensive to manufacture inside the loop, so a green
run is worth something again.

Give the agent a fixed point it cannot move

An agent needs something outside its optimization loop: a specification, a
holdout test, a recorded user journey, a reference output, an invariant, or a
human-approved acceptance check.

That external point does not need to be large. One strong end-to-end assertion
can be worth more than a hundred tests derived from the implementation.

Useful fixed points include:

a test the implementation agent can run but cannot edit without approval,
an acceptance test authored in a separate planning or review pass,
a replay of a real user action against real storage,
an invariant such as "a reported success must be observable after restart,"
mutation checks that deliberately break critical logic and require the suite to notice,
a second agent asked to falsify the result rather than complete the task.

The important property is independence. The proof must contain information
that did not originate from the artifact it is supposed to verify.

A passing suite is a claim, not a conclusion

Agentic development makes it cheap to create code and cheap to create tests.
That is useful, but it also makes it cheap to manufacture agreement between the
two.

If the same system writes the answer key, takes the exam, grades the exam, and
rewrites any question it misses, a perfect score is inevitable. It is also
meaningless.

So keep letting the agent test as it goes and clean up after itself. The fast
feedback loop is real, and it is worth having.

But do not ask its unit tests to prove that its unit tests are telling the
truth.

The final question is not:

Did 156 tests pass?

It is:

What changed in the real world, and what evidence would still fail if that
change did not happen?

Until the system can answer that, 156/156 measures agreement, not
correctness: a == a with excellent formatting.

JWST Data Driven 3D fly-by wire Exoplanet Chart Plotter

Todd Hendricks — Sun, 05 Jul 2026 21:30:13 +0000

I won a 27ft sailboat years ago on a for surplus auction peanuts. I went all in on sailing for a little minute: Dyneema standing rigging, a DIY LiPo battery bank. Man, I miss that boat. I called it Social Distance (I got it around the middle of Covid, blah blah) sold it. The point: I was listening to exoplanet podcasts and that, that boat and the $4300 MFD package I should not have put in a $800 sail boat from 1972 “ Ericsson slope....the hulls were hand laid back then”, anyways voila, Cosmoplot popped out of my butt today. It is still rough, the graphics, the controls, the design, but the bones are solid. If you like this kind of stuff and want to help, that would be awesome.

This is my 3D space chart plotter. You launch from the Sun and fly out to real objects, and the planets and stars you see are rendered downstream from actual physics, computed from real NASA survey data, the Deep sky nebula wouldn't be able to do them justice so they are Nasa archive images not painted by hand.

The inputs are real NASA surveys: the NASA Exoplanet Archive for system and stellar parameters, and JWST spectra served through MAST for atmospheres. Pick a planet and you get a derived physics profile: likely interior composition, habitable-zone placement, atmospheric escape regime, a transmission-spectrum inference from real JWST data, and how detectable its heat would be to JWST.

Live: cosmoplot.io. Source: github.com/H-XX-D/Cosmoplot.

Every physical parameter comes from a real NASA survey: the NASA Exoplanet Archive for system and stellar parameters, and JWST spectra via MAST for atmospheres. Everything the app computes is derived from those values with a published relation, and it is labeled as derived, not observed.

Every value carries its provenance. Each number belongs to a tier, and the tier is shown next to it observed, derived, infered, or proxy, .A rendered planet can be mistaken for a photo, and a population estimate shouldn't be passed off measurement.

The Astrophysics

Interior composition is read from the mass-radius point against the reference curves of Zeng, Sasselov & Stewart (2016), R/Re = C (M/Me)^(1/3.7), with C set by composition (iron 0.86, Earth-like 1.00, rock 1.07, water-rich above).

Habitable zone follows the stellar-flux limits of Kopparapu et al. (2013), with the boundary distance d = sqrt((L/Lsun) / S_eff).
Earth Similarity Index uses Schulze-Makuch et al. (2011) over radius, density, escape velocity, and equilibrium temperature.

Atmospheric escape is screened with the Jeans parameter evaluated at the exobase, not the surface, because for a puffy hot atmosphere the large scale height lifts the exobase well above the surface and changes the answer.
Transmission spectra are inverted with scale-height physics, H = kT/(mu g), on real reduced JWST spectra to estimate the mean molecular weight and separate a light hydrogen envelope from a heavy or cloudy one.

White dwarfs use the degenerate mass-radius relation of Nauenberg (1972) shown against the real Tremblay et al. (2019) Gaia sample.

Rather than reporting single derived values, the app draws each input from its survey error bars with a per-planet seeded sampler, recomputes the full derived chain thousands of times, and reports the 16th, 50th, and 84th percentiles. You get an interval, not false precision.

Testing then deleting my own theory

I started with a phenomenological “correction” to trying to use binding energy to radiation pressure mass radius for atmosphere volume height etc Then I tested it against real catalogs. It failed negative R-squared on one dataset, a coupling parameter that only worked if I let it change sign per situation, cross-checks that did not hold. So I threw it out published relations, and I went back through the committed system write-ups and stripped the framework in favor of actual escape physics.

That was the moment the project became more than puff piece. I was 20 systems and why waste good data I put the physics and the data on a pedestal, because feelings, beliefs, and confidently wrong AI have replaced the scientific method for the majority. Building the provenance system forced me to be honest about which was which.

Source and Appreciation

NASA Exoplanet Archive TAP service, with JWST transmission spectra from MAST for supported targets.
Thank you to the dedicated hardworking scientists and engineers who keep asking the best questions and producing the most clever solutions.

To the Reader

Thanks for checking it out. I got carried away because I thought it was cool, and I shared it for the same reason. I hope you enjoy it, find it useful, and learn, as much as I did building it for you. If you want to help, whether that is a fix, a dataset, or just a bug report, you become officially a non-zero participant in a brighter tomorrow.

P.S.
If you like the web app and want to throw me a bone,

npm i recall-memory-substrate

https://github.com/H-XX-D/recall-memory-substrate

It will change the way you work with agentic AI, pinky swear.

AURA: Handshake the Structure, Then Send the Change

Todd Hendricks — Sat, 04 Jul 2026 23:23:33 +0000

Agent traffic has a strange property: almost every byte is a repeat. Two AI systems exchanging MCP tool calls, A2A task updates, or OpenAI-style function calls send jsonrpc, method, params, trace_id, task_id, and the same schema fragments thousands of times per minute. The values change. The structure barely does.

AURA is an experimental, protocol-aware data-movement toolkit built around that observation. Its main path is AIWire: a negotiated structure side channel that lets two peers agree on message structure once, then move compact deltas over ordinary TCP, WebSocket, HTTP, or broker links instead of re-sending whole JSON frames.

The steady state AIWire aims for is not "send a whole frame more cheaply." It is "handshake the structure, then send the change."

Why stateless compression leaves so much on the table

The obvious fix for verbose JSON is gzip or zlib per message. That works, but it has two structural problems for agent traffic:

Every frame pays setup cost. Stateless compression treats each message as unrelated text and rediscovers the same patterns every time.
History is thrown away. Frame 4,000 of a session looks almost identical to frame 3,999, but a per-frame codec cannot use that.

AIWire keeps a live compression stream per direction across the whole session, seeds it with a static dictionary of common AI protocol fields, and lets peers negotiate session-specific templates on top. After the handshake, the hot path carries only what changed against structure both sides already share.

The three-lane model

The part of the design I find most interesting is that AIWire refuses to treat a connection as one undifferentiated pipe. It splits AI traffic into three logical lanes over whatever transport you already have:

The semantic/message lane carries the actual agent messages: MCP tool calls, JSON-RPC requests and responses, A2A task and artifact updates, traces, handoffs, results. This is the lane the dictionary, session templates, and stateful delta stream optimize.

The control/session lane carries the machinery that keeps the semantic lane safe: handshakes, template discovery, dictionary diffs, ACK/NACK, resume negotiation, heartbeats, and reset signals. The spec requires that control messages stay decodable without inflating the semantic stream. If the compressed stream is resyncing or has failed, you can still read the control lane and recover. Your ops path never depends on the health of the compression state it is trying to fix.

The blob descriptor lane handles the things that should never go through a structured-message codec at all: media, tensor chunks, model artifacts, log archives. The bytes move over a normal blob or file transport. AIWire carries the metadata: content type, SHA-256 digests, chunk manifests, route, priority, and transfer status. A receiver can schedule, verify, and account for a 2 GB artifact without ever pulling it through the message path, and a semantic-lane reset does not invalidate a completed digest-verified transfer.

The separation is a safety argument as much as a performance one. Under congestion, control messages get priority over bulk bytes. Blob descriptors are forbidden from mutating the session dictionary. Each lane fails independently.

Fail closed, by contract

Shared compression state is dangerous if the two sides ever disagree, so the AIWire v1 spec is aggressive about verification:

The handshake compares static dictionary SHA-256 and byte size, template hashes and counts, and zlib parameters. Any mismatch fails closed or falls back to raw/zlib only if the application explicitly allowed it.
Session dictionary growth is append-only, epoch-numbered, and proposed through diffs that carry previous and next state hashes, a fresh nonce, a diff identity hash, and an optional HMAC-SHA256 tag. A sender may not encode against new structure until the matching ACK is verified.
Resume handshakes let a client reconnect against a cached dictionary state, but only if the receiver actually holds one of the offered state hashes.
Any inflate error, hash mismatch, or ordering violation means stop, rehandshake, or fall back. The spec's phrasing: peers must not continue sending compact deltas against uncertain structure.

The metric is exchanges, not ratio

AURA's docs are explicit that compression ratio alone is the wrong scoreboard. The question is how many verified semantic exchanges fit through a link once bandwidth, p95 latency, and codec CPU are accounted for.

On a modeled 10 Mbps link with protocol-shaped request/response traffic (native C++ backend, 2026-07-04):

Codec	Bytes/exchange	Bandwidth-capped ex/s	Gain over raw
raw JSON	1,177	1,756	1.00x
zlib per frame	696	2,992	1.70x
AIWire	157	11,017	6.28x
AIToken + AIWire	125	12,948	7.38x

A live TCP replay of the committed public session corpus, with 64 concurrent logical agents and SHA-256 verification of every response, pushed further: AIWire averaged 45.6 bytes per exchange for a 24x bandwidth gain, and the combined AIToken + AIWire path hit 32.3 bytes per exchange, a 34x gain with 97.1% of bytes saved. At that point the modeled link was no longer the bottleneck; the runtime could not keep enough requests in flight to fill the headroom.

That last detail is the honest core of the project. Smaller frames only matter if your system has enough concurrent work to use the room they create. AURA ships the extrapolation tooling to reason about exactly that: given a bandwidth, a p95 latency, and a per-agent window, how many agents does it take to saturate the link.

Where it fits

AURA is for situations where you control both ends of the link and the traffic has repeated structure:

Multi-agent request/response loops. Orchestrators, workers, and reviewers exchanging thousands of small task, status, and result messages.
MCP and JSON-RPC tool traffic. Tool calls and tool results are the canonical case of stable structure with changing values.
Local AI clusters and edge links. The repo's LAN benchmark runs a Mac against a Z6 workstation and Jetson Nano-class boards; a bandwidth-limited edge mesh is exactly where an 86 to 97% byte reduction converts into headroom for telemetry, media, and retries.
Structured logs and traces. Repeated field names, session-stable shapes, high volume.
Binary payload routing. Agents that need to schedule, verify, and track opaque artifacts by digest without moving the bytes through the message path.

What it is not

The README is unusually direct about limits, and it is worth repeating them. AURA is not a drop-in replacement for gzip, zstd, TLS, or a message broker. It does not define transport security, retries, or backpressure; those stay at the transport layer. The stateful stream means frames cannot be reordered or dropped inside a session, so lossy transports need their own recovery layer. And it is not production-ready: it is a prototyping and measurement toolkit with a working Python path, a native C++ backend, deterministic public fixtures, and reproducible benchmark harnesses.

That fixture corpus deserves a mention. The repo commits a synthetic public session corpus covering MCP, A2A, OpenAI Responses, traces, handoffs, and memory writes, wrapped in the full side-channel lifecycle: forced handshake, template update, authenticated dictionary diff, ACK, and resume. Anyone can replay the exact benchmark and check the numbers.

Trying it

from aura_compression import AIWireSessionEncoder, AIWireSessionDecoder

message = {
    "protocol": "mcp",
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {"name": "read_file", "arguments": {"uri": "repo://service/path.py"}},
}

with AIWireSessionEncoder(level=3) as encoder, AIWireSessionDecoder() as decoder:
    delta = encoder.compress_message(message)
    restored = decoder.decompress_message(delta)

assert restored == message

The repo includes transport examples for length-prefixed TCP, WebSocket, HTTP with Server-Sent Events, and a local broker, plus the full benchmark harness used for the numbers above.

Agent-to-agent traffic is growing faster than the links it runs on, and most of it is the same structure sent again and again. AURA's bet is that the fix belongs in a negotiated session protocol, not a per-frame codec. The three-lane model, the fail-closed handshake contract, and the exchanges-per-second scoreboard are what make it worth watching.

AURA is Apache 2.0 licensed. Code, spec, fixtures, and benchmark reports: github.com/H-XX-D/AURA.

I asked the Fable 5 which memory it would rather run on

Todd Hendricks — Thu, 02 Jul 2026 18:32:44 +0000

I switched my terminal over to Fable 5, Anthropic's new frontier model, and put a blunt question to it: you have run on Claude Code's built-in auto-memory, and you are running on my memory substrate right now. Can you actually tell a difference?

It said yes, and it brought a receipt from twenty minutes earlier in the same session.

I had asked it a mundane question: what is my next scheduled blog post. My content calendar lives in the memory graph, and the graph held two versions of it: an older cell with the original ordering and a newer cell that resequenced it, linked by a contradicts edge. The compile packet the model reads at the start of a turn includes a conflicts section, and that section flagged the old ordering as challenged before the model ever quoted from it. Its own summary: with auto-memory, the stale calendar comes back as flat text with nothing marking it superseded, and I would have confidently given you the wrong post.

That is the difference in one sentence. One system hands the model facts with epistemic state attached: confidence, challenged, stale, who wrote it and when. The other hands it text it has to take at face value.

Discount the self-report first

Before quoting a model's opinion about anything, name the confound: models tend to agree with the framing of whoever is asking, and I built the thing I was asking about. So treat the interview as color, not evidence.

The evidence is a store-level battery with no LLM in the scoring loop. Seven deterministic scenarios, run against two stores: a faithful flat-file model of Claude Code's auto-memory (a real MEMORY.md index plus per-fact .md files with metadata frontmatter, overwrite-in-place on correction) and the real recall CLI on an isolated database.

Audited score: Recall 6.5 of 7, auto-memory 3.5 of 7. The audit is the part I trust most. The first run scored 7 against 2.5. Three independent agents then reviewed the harness adversarially, called it mildly biased toward Recall, found a genuine false positive, and the corrected run is the number I keep. A benchmark that got less flattering after an audit is worth more than one that never had one.

And the honest reading of the split: auto-memory ties on basics. In a separate agent-level A/B, both stores went 3 for 3 on simple current-value questions. Flat prose can simulate simple supersession fine. The gap opens at scale and on corrections. My graph currently holds about 1,180 cells with 86 tracked contradictions. A flat file at that size is a pile of sentences, and no sentence in it can answer "what changed since Tuesday" or "which of these beliefs is contested," because those are questions about the store's state over time, not about any fact inside it.

The four mechanisms that do the work

Supersession instead of overwrite. When a fact is corrected, auto-memory overwrites the old file in place. The history is gone, and a corrected fact is indistinguishable from a never-wrong one. In the graph, the new cell carries a contradicts edge to the old one; the old cell's effective confidence drops and it stays visible as superseded. The correction is recorded as a resolution, not a deletion.

Per-prompt push instead of load-at-start. A hook compiles a small index of relevant cells, ids and staleness flags included, into every prompt. The model does not have to remember to look; the current state of the graph re-enters its context each turn. Auto-memory loads once at session start and then drifts.

Ids-first reads instead of whole-file loads. The compile packet returns handles. The model expands the two cells it needs instead of ingesting the whole store and hoping the relevant paragraph survives.

Questions about the store itself. What changed in the last day. What is stale. What is contested. These are answered by diff and health tools reading timestamps and edges. There is no flat-file equivalent, not because nobody wrote one, but because the file does not contain the information.

The long-run answer

Then the follow-up I actually cared about: for a long-running task, which would you rather have underneath you?

Its answer, compressed: what kills a long session is context compaction. The window gets summarized, the summary is lossy prose, and nothing marks what got dropped or corrected along the way. A memory that loads at session start does not help mid-task. The per-prompt push re-anchors the model after every compaction, from the graph rather than from whatever survived the summary. And long tasks accumulate corrections: something believed in hour one gets falsified in hour three, and the specific way long autonomous runs die is an agent confidently resuming from a belief nobody told it was stale.

It conceded the cost without being asked, which I appreciated. The write discipline burns tokens every turn, and on a ten-minute task it may never pay back. On a long run it amortizes, and the writes double as an audit trail of what the agent did and why.

One operational footnote if you want to reproduce any of this: Claude Code's native auto-memory shadows an external store while it is on. We tested arming the agent every way we could think of; it kept writing flat .md files regardless. CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 is the switch. The two do not coexist.

The two questions, again

Strip the interview away and you are left with the two questions I keep coming back to. Is your agent actually using your memory, or a shadow store sitting next to it? And if a fact in that memory were wrong, would anything in the system know?

A self-report from a frontier model is a data point, not a verdict, and this is a field report from one stack, half of which I built. Run the two questions against your own setup and tell me where I have it wrong: https://github.com/H-XX-D/recall-memory-substrate

"Memory adherence is a systems problem. So which model lets you build the system?"

Todd Hendricks — Tue, 30 Jun 2026 16:37:28 +0000

I got a comment on the last post that I want to answer properly, because it gets at the real question. They agreed that memory adherence is a systems problem and not just a prompting problem, then asked whether I had tested the approach across model families (GPT, Claude, Gemini) and seen meaningful differences in write reliability.

There are differences, but they are not about the model. They are about how much of the turn each family lets you control. If adherence is a systems problem, then the thing that actually decides write reliability is which family hands you enough control surface to build the system in the first place.

Here is what I have found in practice. Treat this as a field report, not a benchmark, and an early one. I have not pushed every family equally hard, and that matters for how much weight the comparison can carry. More on that below.

Claude is where I have gone the deepest

This is the family I have spent the most time controlling, so weigh the claim with that in mind. What I can say is that Claude gives you a full ladder, from shallow to deep, and you can stop at whatever rung the job needs.

The shallow rung is nudges: the settings page and system instructions. This is the weakest form, and it is the one that decays, the same "tell it harder" failure I wrote about last time. It helps a little, and it does not hold.

Above that are skills and hooks. Hooks are the first rung where adherence stops being a request and becomes an event: something fires at the start of the turn, at write time, at the end, whether the model felt like it or not. They work. They can also be finicky to get right.

The deep rung is the SDK. With it you get targeted control of the model's turn, the whole prompt to response lifecycle, beginning to end. This is where the write stops being a hope and becomes wiring you own. Nothing else I tested hands you the full lifecycle like this.

The point is not that any single rung is magic. It is that Claude lets you go as deep as the reliability you need, and the deeper you go, the more deterministic the write becomes.

ChatGPT and Codex are close, through AGENTS.md

Codex was a lot easier than I expected, because its instruction set is the AGENTS.md file. That one anchor did most of the work, and the net result was only slightly less consistent than Claude.

Honest caveat: I have not dug into the Codex SDK yet. So the ceiling there is probably higher than what I have measured, and the real gap to Claude may be smaller than it looks right now.

Gemini and Grok lean on their own memory

The Gemini and Grok CLIs are a different story. They seem to want to do the least amount of work possible, and they lean on their own KV, the in-context recall, more than they reach for an outside store. You end up working against a default that would rather not call your database at all.

The exception is websearch, and it is not a good one. It comes back noisy and minimal, and it gets actively counterproductive when it does surface strongly correlated evidence, because then the model over-trusts the search and skips the store it should have read. Getting real adherence out of those two means digging into their architectures to optimize for it, and that is still on my horizon, not something I have solved.

Why this is a field report and not a benchmark yet

Building SENTINEL, a real benchmarking suite for this, is my top priority right now. The honest problem is that most memory systems do not expose the turn the way you would need to in order to measure whether a write happened at the right moment, so you cannot make them play the same game. The way I keep it fair is to grade what each system's outputs actually demonstrate, on a capability ladder, rather than the levers it was never built to expose. A system that does not surface something scores lower on that axis without being punished for it, and I say plainly which axis is push and which is pull, so a gap reads as a different shape, not a worse one. The full design is its own post.

The bar I hold myself to is the part that makes a benchmark mean anything. A bench only my own system can run is not a benchmark, it is a demo with a scoreboard. And a bench that hands you a number while attributing the result to the wrong layer, blaming the store when the model just queried it badly, or crediting the system when the model did the work, is worse than no number at all. So fair adapters and correct attribution come first, and the leaderboard comes second.

The takeaway

The model is not the variable that matters most for write reliability. The control surface is. The deterministic parts of a good memory system, the validation gate and the demotion math, hold the same on any of these. What changes across families is how much of the turn you are allowed to hook to force the write at the right moment. In my hands so far, Claude exposed the whole lifecycle, Codex gave a strong instruction anchor through AGENTS.md, and Gemini and Grok leaned hardest on their own context. But I have to be straight about the confound: I went deep on Claude's hooks and SDK, I leaned on Codex without touching its SDK, and I have barely pushed Gemini or Grok past their defaults. So read this as a map of where I have dug, not a verdict on which family is most controllable. Making that comparison fair, with equal effort and correct attribution, is exactly what SENTINEL is for.

So "which model writes more reliably" turns out to be the wrong question. The real one is which model lets you build the most system around the userprompt_init read write turn_end mechanics. That is the same point the commenter started with, followed all the way down.

github.com/H-XX-D/recall-memory-substrate

Memory Abstraction Layer: MAL is HAL concepts applied to agentic memory systems

Todd Hendricks — Mon, 29 Jun 2026 12:59:51 +0000

I am a mechanical engineer by trade. I build CNC robots. In that world, two things cause errors and crashes: bad program instructions and noise. A programmatic error comes from a bug, either in the control system or in the subprogram instructions the machine is running. Noise is electrical: EMI out of circuit coupling, current taking a path it should not because of impedance back to the source. One is a fault in what you told the machine to do. The other is the environment corrupting a signal that was clean when it left.

I have run LinuxCNC for years. It uses a system called HAL, the Hardware Abstraction Layer, to define and control the machine. HAL is how you describe every pin, signal, and component, then wire them into one running system you can read off a page.

When I started pulling AI into what I do, the biggest hurdle was not a new problem. It was the same two failure modes in different clothes. A model gives you bad instructions when its context is wrong, and it drifts when the known-good state degrades over time, which is just noise corrupting a signal that used to be clean. Keeping the model's current state accurate, and stopping the good state from rotting, was the whole fight.

So I treated it like a machine fault. I put my critical thinking, problem solving, and diagnostic troubleshooting to work on it the same way I would on a crash on the shop floor. The result is MAL, the Memory Abstraction Layer, the functional layer of how Recall works. It is a distillation of what I already knew, applied to AI systems and accelerated by AI to fill the gaps in my knowledge and write the harder code syntax for me.

MAL is HAL one layer up. Instead of abstracting hardware, it abstracts memory. It is not a literal port, not HAL's wiring copied onto a database pin for pin. It is the concept of how HAL works, the whole pattern of pins, signals, components, and a scheduler, applied to an AI's durable memory. HAL controls a machine. MAL controls the thing that kept breaking when I put AI on the bench: the state carried across each user and AI turn.

Status: this is implemented as a running Recall in AIDDE(Artificial Intelligence Driven Development Environment), not just an architecture sketch. The screenshot shows the Recall panel operating against a persistent graph, and the code snippets later in this post show the four boundaries that matter: compiling a mini-index, expanding selected cells, writing claims through an admission gate, and running deterministic recomputation outside the model. The full source is not published here, so read this as a prototype disclosure rather than a reproducible benchmark.

Recall running inside the local agent workspace. The Recall panel is connected to a SQLite-backed graph, showing 1,148 cells, 1,143 relations, active memory-in-use cards, compile/search/write controls, and a 900-word compiled memory budget. This screenshot demonstrates the working interface; the snippets below show the MAL loop underneath.

What it actually does, one turn at a time

MAL is a control system, and the thing it controls is the user-and-AI exchange. Each turn is one cycle. The per-turn protocol has five beats: push, expand, work, write-back, tick. A session primes once at the start, then every turn runs the cycle.

Push. A prompt arrives. Before the model sees it, a hook pushes a mini-index: a short list of candidate cells, each shown as an id, a title, a compact score row, and any flags. Not the contents, just the headers. The lines look like this:

67ee107d [decision] Recall v5 architecture named: MAL (Memory Abstraction Layer)
b63c2d54 [decision] MAL offloads the work: model states claim + confidence  [SUPERSEDED?]

Expand. The model reads by title and pulls the full body of only the few cells worth reading; the rest stay as one-line headers. A 200-cell graph and a 200,000-cell graph cost the model the same amount here, because it only ever reads the slice it asked for. If a row carries a flag (stale, challenged, superseded), the model has to open that cell before it can act on the topic. That rule is enforced, not suggested: skip the dig and the turn is blocked until it is done.
Work. The model does the real task with the expanded cells in hand.
Write-back. On the way out, the model writes what it learned. Its entire authoring job is a claim (a kind, a title, a body) and one calibrated confidence number, plus the edges it intends. If the new fact corrects an old one, it points a contradicts edge at that old cell's id, and the old cell loses standing. The model never hand-formats the notation or computes a score. The builder and the admission firewall do that.
Tick. Between turns, with no model running, a deterministic operator pass recomputes the scores, currency, salience, and the standing signals. When the next prompt arrives, the push already reflects the new state.

That is the loop. The model states claims and judges relevance. Everything stateful, scored, or always-on happens in deterministic code around it, between turns. No op calls an LLM.

The hooks that close the loop

The five beats are not something the model remembers to do. They fire on their own, driven by three hooks at three moments. In HAL terms, the hooks are the thread: the scheduler that runs the parts in order, every cycle, whether or not anyone is paying attention.

Session start (orient). Once per session, before any work, a hook injects the operating manual: how the memory works and what the graph is about. It is inject-only. It primes the context window and then gets out of the way.
Prompt submit (push). On every prompt, before the model runs its forward pass, a hook pushes the mini-index: the seed cells, their flags, and a few terse reminders. This hook has teeth. It can block, so a flag like "expand required" is not a polite request. It also nudges the model to consider standing up a recurring read as its own op during the turn, before write-back.
Stop (write-back and backstop). After the answer, a third hook handles the end of the turn. It is the wrong place to prime anything, because the pass is already done, so its job is the opposite: make sure the turn wrote back what it learned, and refuse to release the turn if a flagged cell was never opened.

Between turns, with no model in the loop at all, the deterministic tick runs the ops and recomputes the signals. Orient before the session, push before the pass, write-back after it, tick between turns. That is the whole schedule, and the model only occupies the middle of it.

One rule keeps the hooks lean. The expensive, stable content (what every op means, how the addressing works) is taught once, in a single map cell inside the graph. The per-turn push never re-explains any of it. It only points, carrying the cheap, changing part: which cells are in play this turn and which ones are flagged. Teach once in the graph, reference tersely every turn. It is the same split as keeping the operating manual as cells instead of as a string baked into a hook.

The concept, mapped from HAL to memory

The reason HAL was the right thing to copy is that its parts already have clean jobs, and every one of them has a memory counterpart. This is the correspondence, not a literal rewrite:

HAL	MAL
pin	a cell field
signal	an addressable value (a derived field has one owning op, for tick determinism)
component	an op (watch, watchdog, trend, drift, quorum, score, reflex, smooth, clamp, latch, route, fanout, snapshot, record, replay, pid, oneshot)
thread	the operator tick, running between turns
net (the wire)	the dotted address
netlist (the .hal file)	the memory netlist

In HAL you wire components to signals on a thread and you get a machine you can read off one file. In MAL you wire ops to values on the tick and you get a memory you can read off one netlist. The structure carried over. What changed is what flows through it.

Why a control layer is the right shape

The analogy is not decoration. It holds because the two problems are the same problem.

A control system exists to keep a process in a known-good state against two enemies: bad commands and noise. On the machine, a bad command is a buggy instruction in the program, and noise is EMI corrupting a signal that left clean. The whole job of HAL is to make the machine legible enough that you can see both coming: every signal named, every connection on the page, a scheduler keeping the readings current.

Memory degradation in an AI is the same two enemies under different names. A bad command is a wrong or stale fact entering the model's context. Noise is drift: the known-good state decaying as new, weaker, or contradictory claims pile up over time. Left alone, both corrupt the state the model acts on, the same way they corrupt a machine. So the fix has the same shape: name every value, keep the wiring legible, reconcile conflicting inputs into one trustworthy reading, catch the bad state and replace it on the record, and run a scheduler that keeps the picture current between moves.

That is why a hardware abstraction layer, of all things, was the right pattern to lift. Not because memory is like hardware, but because keeping memory accurate is a control problem, and HAL is a control-system design that already solved the legibility and scheduling parts. MAL is that design pointed at the state of the user-AI exchange instead of at motors.

Where MAL leaves HAL behind

A concept is only worth borrowing if you are honest about where it stops fitting. Three places MAL departs from HAL, and they are the interesting part.

Many writers, one reader. This is the inversion, and it is the heart of it. HAL is one writer, many readers: one pin drives a signal, many components read it, and the value is whatever the writer put there. MAL is the opposite. Many actors write to a cell over time, claims, edges, supersessions, from different agents and different sessions, and there is one reader: the single agent reading the compiled slice this turn. Because the writers are many and fallible, the value a cell shows is not any one writer's number. It is a reconciliation. This is why a cell has both a stated confidence and an effective confidence, and why they differ: stated is what a writer claimed, effective is what survives calibration, support, and contradiction once everyone's contributions are weighed.

The edges are real and directional. HAL draws arrows on its signals but ignores them, because in hardware the direction of flow is already implied by who writes and who reads. MAL edges carry meaning, so direction is load-bearing. a > b is the directed edge from a to b; a < b is from b to a. A supports edge and a contradicts edge pointing the same direction do very different things to the effective value downstream.

Versions and supersession. HAL is a flat wiring layer with no history. MAL has a time axis: a cell can be superseded, and the supersede chain is addressable by version (@vN). A correction does not overwrite the old value; it demotes it and records the replacement, so a later reader sees both the current fact and the one it replaced, plus why. That is the whole defense against the known-good state quietly rotting: nothing good gets silently overwritten, it gets superseded on the record.

Put together, these are why MAL is a control system and not just storage. It does not only hold the state of the user-AI exchange; it reconciles many fallible inputs into one trustworthy reading, keeps direction and history, and recomputes the picture every tick.

The notation

Because the rendered graph is meant to be read by sight, MAL has its own small language, modeled on HAL's. It has a lexicon (the words) and a grammar (the sentences).

The lexicon

Handle: kind_hex, a three-letter kind prefix and a short hex tag, like dec_a3ee for a decision. ALLCAPS marks an immutable cell (RECALL_v5); lowercase is mutable.
Separators, by how tightly they bind: _ joins words inside one name; - walks a field within a cell (dec_a3ee-scores-eff); . crosses an edge to a neighbor (dec_a3ee.supports), so the number of periods is the number of graph hops.
Values: written field(value). A ! inside marks an immutable number (conf(.7!)); bare is mutable. Types are float for scores and bit for actuators.
Version: @vN is a point on the supersede chain. Wildcard: .* fans out over every neighbor through an edge (dec_a3ee.supports.*).
Expand-required: a leading ^ in the mini-index means the cell is superseded, stale, or challenged, and the model must expand it before use (^dec_a3ee ...). That caret is the dig flag from the loop above, written in one character.

The grammar

The sentences follow HAL's halcmd style. Tokens are separated by a single space, the name comes first, and connections follow. A quoted "..." string is one token, exempt from the space rule, used for free text like a title or body. A # runs to end of line as a comment. Direction with < and > is meaningful.

The sentence forms:

form	shape	example
wire (net)	`net <signal> <target> <inputs>...`	`net eff dec_a3ee < conf calib supports.* contradicts.*`
set (setp)	`<addr> = <value>`	`dec_a3ee-flags-annexed = true`
schedule (addf)	`addf <op> tick`	`addf contradiction-load tick`
edge	`<source> <relation>> <target> (<weight>)`	`dec_a3ee supports> dec_signals_a2b7 (+.6)`
render (read)	`<handle> "<title>" <field(value)>... <relation>-><target>(<w>)...`	see below

A netlist snippet

Here is one cell rendered in read form, then wired and scheduled in write form:

# a cell, rendered: handle, title, scores, then edges
dec_a3ee "add watchdog op" conf(.7!) unc(.10) eff(.61) curr(.9) sal(.5) annexed(0) pinned(0)
  supports> dec_signals_a2b7(+.6)  contradicts> obs_9c1f(-.8)

# wire the effective-confidence signal on it (write form)
net eff dec_a3ee < conf calib supports.* contradicts.*

# declare an edge (direction: > forward a to b, < reverse)
dec_a3ee supports> dec_signals_a2b7 (+.6)

# fire an actuator
dec_a3ee-flags-annexed = true

# schedule a between-turn signal onto the tick
addf contradiction-load tick

Read the top line and the many-writers-one-reader idea becomes concrete. conf(.7!) is the stated confidence, immutable, what the author claimed. eff(.61) is the effective confidence, mutable, what is left after calibration plus the +.6 support and the -.8 contradiction are reconciled. The reader gets .61, not .7. The net eff line is the wiring that produces it: the effective signal is a function of the stated confidence, the writer's calibration, and the fan-out over every supporting and contradicting edge.

What the language does not do

The grammar wires ops; it does not define their math. The formulas (the effective-confidence reconciliation, the per-type currency decay, the allocation-pressure math) live inside the ops, the way a HAL component's math lives in compiled C and not in the .hal file. The language only connects pre-built ops to values and to the tick. The one op you can configure without code is the reflex, set with a truth-table personality rather than a formula, so even user-defined boolean logic needs no expression language. That keeps the surface small on purpose.

Status of the language. Be clear about what runs. The graph renders to this notation today, but one direction only: graph to text. A parser and loader that read a netlist back into a wired graph are specified here and not yet written. That reader is the next piece, and its acceptance test is a round trip: render the graph, parse it, load it, render again, and require the two renders to match. The model never reads the netlist either way; it reads the compiled slice. The netlist is for human audit and for tooling such as replay, diff, and version control.

Borrowing the next layer: components

Everything so far buys one thing: a durable, structured state with a gate on what gets in, where admission has the same shape no matter who wrote it. Every claim, from any actor, any agent, any session, goes through the one firewall and comes out in the one contract. That uniformity is not a nicety. It is the precondition for the next borrow from HAL.

Here is why. In HAL, a component can read a signal without knowing or caring which component drives it, because every signal is a typed value with one shape. That is the only reason you can wire a deterministic component to a wire and trust what it reads. MAL gets the same guarantee from the admission gate: many writers, one shape. Once a value is guaranteed to have that shape regardless of author, a deterministic subprogram can wire to it and run on it safely. The gate is what turns a pile of claims into clean signals.

So you can take the second layer of HAL, the components. In HAL a component is a small compiled subprogram that reads signals, computes something, and drives other signals, all scheduled on the thread. In MAL a component is the same idea over memory: a small deterministic program that reads cell values, computes something more involved than a single score, and either writes a derived value back or fires an actuator, scheduled on the tick between turns. No model runs inside one, the same way no model runs inside any op.

The ones I wired up are the controls-room set: a watch that trips on a threshold, a trend that takes the rate and acceleration over a series of cells, a drift that measures a value against a pinned baseline, a quorum that fires on k-of-m agreement, a score that rolls a metric. The boolean logic is one configurable component, a reflex, that covers the whole and2, or2, xor2 family with a truth table instead of a formula. That is what lets you connect them the way you connect logic on a machine: wire two watches through an or2 so the alert trips if either condition goes bad, latch it so it stays tripped across turns, fan it out to a severity readout. A tripwire is that composition given a job: a deterministic condition that stays silent until it trips, so silence itself becomes the all-good signal, and the only thing that ever speaks up is a real change.

This is where the memory stops being a place you read from and starts being a system that watches itself. The components run between turns whether or not anyone asked. A threshold passes, a webhook fires, and a decision that drifted out of its known-good band tells you on its own.

HAL gives a machine reflexes that do not wait for the operator. The same components, one layer up, give the memory reflexes that do not wait for the model.

It is not rebuilt every turn

A fair worry about a stateless model is that it has to stand the whole apparatus up again on every fresh turn. It does not. The system persists in the store and in the deterministic tick, both of which run between turns with no model involved. The only thing that is fresh each turn is the model's working context, and rebuilding that context is exactly the cost MAL removes. Instead of re-deriving state from scratch or re-reading raw transcripts, the model reads back a thin, pre-digested, trust-weighted slice: the mini-index first, then selective expansion. And because the model wrote those cells in the first place, reading them re-evokes its earlier reasoning instead of reconstructing it cold.

The graph boots itself

A fresh MAL graph starts from a deterministic 10-cell bootstrap, then the normal loop takes over and init never fires again for that graph.

Cells 1 to 5 are the system layer, the constitution: auto-written, locked, pinned, immutable, and identical in every graph.

purpose
method
map (the MAL structure itself: addressing, cell anatomy, edge semantics)
hooks (the lifecycle: orient, push, write-back, tick, the compaction boundary)
expectations (the behavioral contract: wire your edges, pick the right kind, supersede on real change, confidence is recorded and weighed, do not assert from unchecked memory, dig flagged cells)

Cells 6 to 10 are the foundation, the project charter: answered one question at a time by the user, and mutable.

objective
constraints
risks
success criteria
carried context

Putting the operating manual in the graph as cells, rather than as a string baked into a hook, is what lets it survive a context compaction and be re-evoked afterward. The map being cell 3 is the point: the structure teaches itself from inside the store it describes.

How it came together

Two things had to meet for this to work, and they came from opposite directions.

The first was the problem, seen from the inside. Recall was not built as a database for me to query. It was built for the agent. It started by asking the model what it actually needed in order to remember well and to trust what it remembered, and the answers are the whole design: typed claims with a calibrated confidence, supersession instead of overwrite, and a record of what contradicts what. Earlier versions were far more ambitious and sprawling; the part that survived and narrowed into Recall was the memory core. Most pull-based memory tools inherited the human metaphor of a database you go and search. This came from asking the thing that has to live in the memory what would keep it honest.

The second was the structure, brought in from another trade. I already knew HAL cold from years on LinuxCNC, and when I sketched how to address and wire a memory graph, it landed on the same path-addressing shape HAL uses. Recalling HAL from the shop and deriving the addressing for memory met in the same place. Two independent routes arriving at one design is about the strongest signal you get that the design is sound.

After that it was diagnostic work plus acceleration. I used the troubleshooting habits I lean on for a machine crash to find where the memory state was breaking, and I used AI to fill the gaps in what I did not know and to write the harder code syntax. The concept is mine and comes off the shop floor. The speed of building it came from the same kind of system it was built to improve.

Under the hood: the four boundaries

This part is a prototype disclosure, not a reproducible benchmark. The snippets below are from the running Recall v5 source, trimmed for readability with elisions marked; the formulas and signatures are verbatim. They show the four boundaries where the design either holds or it does not: Recall sits upstream of the model, the read is a mini-index then a selective expand, every write goes through one gate, and the scores recompute deterministically with no model in the loop.

Recall is upstream of the model. Before the model runs, the prompt's objective is compiled into a Recall packet and merged into the text the model receives. The packet is built first, so the model sees reconciled memory before it acts.

export function buildPromptContextPush(
  store: Store,
  objective: string,
  options: ContextCompileOptions & DirectiveOptions = {},
): PromptContextPush {
  const packet = compileContext(store, objective, options);
  const directive = recallDirectiveBlock(options);
  const expansionRequired =
    packet.staleOrLowTrust.length > 0 || packet.conflicts.length > 0;
  const text = [
    "[Recall context push for this prompt]",
    directive.trimEnd(),
    "",
    formatContextPacket(packet),
    expansionRequired
      ? "EXPAND REQUIRED: conflicts or low-trust cells are present; inspect relevant handles before relying on them."
      : "Use expansion_handles only when exact evidence matters.",
    "",
  ].join("\n");
  return { objective, directive, packet, text, expansionRequired };
}

The Codex adapter wires Recall's MCP server into Codex so the same packet and tools are reachable there; the push itself is platform-neutral.

1. Compile the mini-index. The prompt becomes a ranked seed set, one mini-index line per hit, and a cell that needs review carries the expand flag. compileContext wraps this and trims the packet to a word budget (the 900 in the screenshot).

export function compile(
  store: Store,
  query: string,
  opts: { limit?: number } = {},
): CompileResult {
  const limit = opts.limit ?? 10;
  const hits = store.search(query, { limit });
  const lines = hits.map((h) =>
    renderMiniIndexLine(h.cell, { expand: h.cell.flags.requiresReview }),
  );
  return { hits, lines };
}

2. Expand selected cells. Mini-index first, selective expansion second. A handle (a full id, or id#field.path) opens exactly one cell plus its neighbor links, never the whole graph.

export function inspectCell(store: Store, handle: string): CellContext {
  const parsed = parseExpansionHandle(handle);
  const cell = store.get(parsed.target) ?? store.getByHandle(parsed.target);
  if (!cell) throw new Error(`Unknown cell: ${parsed.target}`);
  const neighbors = store.neighbors(cell.key);
  const incoming = neighbors.filter((link) => link.direction === "in");
  const outgoing = neighbors.filter((link) => link.direction === "out");
  // ... footprint (word and byte counts), optional field preview ...
  return { cell, incoming, outgoing, /* footprint, */ expansionHandles };
}

3. Write through the admission gate. The model hands in a claim (a kind, a title, a body), one confidence number, and the edges it intends. Every author runs the same pipeline: validate, screen for secrets, attenuate unsupported confidence, build the cell, then fold in the actor's calibration to get effective confidence. The model never formats the cell or computes a score.

export interface WriteProposal {
  kind: string;
  title: string;
  body: string;
  confidence: number; // (0, 1], required, no default
  edges?: { relation: string; target: string; weight?: number }[];
  // ... topics, entities, sourceRefs, operation, origin, verification ...
}

export function admit(proposal: WriteProposal, ctx: AdmitContext = {}): AdmissionResult {
  const validation = validateProposal(proposal);   // R0 schema; reject on any structural issue
  if (!validation.ok) return { accepted: false, issues: validation.issues, warnings: [], attenuations: [] };

  const screen = screenSecrets(proposal);           // reject if a credential pattern is present
  if (!screen.allowed) return { accepted: false, issues: screen.issues, warnings: [], attenuations: [] };

  const factor = ctx.calibrationFactor ?? 1;         // 0.5..1 from the actor's track record; 1 = neutral
  const att = attenuateConfidence(proposal);         // cap unsupported high confidence
  const cell = buildCell({ ...proposal, confidence: att.confidence }, { key: ctx.key, now: ctx.now });

  cell.scores.actorCalibration = factor;
  cell.scores.effective = effectiveConfidence({
    stated: att.confidence, calibration: factor, supportMass: 0, challengeMass: 0,
  });
  // with a store: dedup, apply supersedes edges, recompute neighbors' effective ...
  return { accepted: true, cell, issues: [], warnings: att.warnings, attenuations: att.attenuations };
}

4. Recompute on the tick, with no model. This is the line between MAL and a plain memory database. Between turns, every active cell decays its currency from its own timestamp and recomputes its effective confidence from current support and contradiction mass. Pinned cells are exempt from decay, and a tick never counts as reinforcement.

// effective = clamp01(stated*calibration + 0.15*tanh(support) - 0.6*tanh(challenge))
export function effectiveConfidence({ stated, calibration, supportMass, challengeMass }) {
  return clamp01(
    stated * calibration + 0.15 * Math.tanh(supportMass) - 0.6 * Math.tanh(challengeMass),
  );
}

// currency = cFloor + (c0 - cFloor) * exp(-dt/tau)   (dt and tau in days)
export function currency({ c0, dt, tau, cFloor = 0.1 }) {
  return cFloor + (c0 - cFloor) * Math.exp(-dt / tau);
}

// the between-turn deterministic tick (HAL's "thread"); no LLM runs here
function recompute(store: Store, cell: Cell, now: string): Cell {
  const scores = { ...cell.scores };
  if (!cell.flags.pinned) {
    const dt = Math.max(0, (Date.parse(now) - Date.parse(cell.updatedAt)) / DAY_MS);
    scores.currency = currency({ c0: cell.scores.currencyC0, dt, tau: TAU_DAYS[cell.stability] });
  }
  const m = neighborMass(store, cell.key);
  scores.effective = effectiveConfidence({
    stated: cell.scores.conf, calibration: cell.scores.actorCalibration,
    supportMass: m.supportMass, challengeMass: m.challengeMass,
  });
  return { ...cell, scores }; // updatedAt preserved: a tick is not a reinforcement
}

The verifier. A functional verifier, npm run verify:recall-panel, was added for the Recall panel and passes. It checks that the panel is correctly wired to the graph (the SQLite-backed store and the compile, search, and write controls), not that it clears any performance number. Read it as a wiring check, not a benchmark.

Recall, MAL, and AIDDE

A quick map of the three names, because they get used together and they are not the same thing.

Recall is the programming foundation. At the bottom is a local-first memory substrate: a SQLite-backed graph of typed cells, an admission gate every write passes through, calibrated confidence, supersession instead of overwrite, and a compile path that returns a ranked, budgeted slice. That layer ships as a package and runs today. It is the working base everything else stands on, and it is what the four boundaries above are made of.

MAL is what that foundation evolves into. v5 recasts the same primitives as a hardware abstraction layer for memory: a cell field is a pin, an addressable value is a signal, an op is a component, the between-turn tick is the thread, and the rendered graph is a netlist. On top of the proven store it adds the deterministic op and signal layer and the addressing language. The four boundaries earlier in this post are MAL running. The netlist language is MAL specified, with the reader still to come.

AIDDE is where it runs. The screenshot at the top is AIDE, an agent workspace with Recall embedded as a panel. The agent compiles, searches, and writes the same SQLite graph from inside the editor, against a live cell count and a word budget, so the memory layer is not a side service the agent calls out to; it sits in the workspace the agent already works in. MAL is the layer that panel stands on.

So Recall is the substrate, MAL is the abstraction layer it grows into, and AIDE is the workspace that puts both in front of a working agent.

Why this shape holds up

Two things make MAL age well. It rides capability gains for free: a stronger model uses the same layer better with no rewrite, and a weaker model still gets the deterministic floor underneath it. And it keeps the expensive, stateful, always-on work in deterministic code where it belongs, leaving the model to do the one thing only it can do, which is to state a calibrated claim and judge relevance.

That is the whole bet, and it comes straight off the shop floor. A machine does not stay accurate because the controller is smart. It stays accurate because the wiring is legible, the signals are reconciled, the bad state gets caught and replaced instead of silently riding along, and a scheduler keeps the picture current between every move.

MAL is that discipline, applied to an AI's memory. HAL one layer up, over memory instead of motors.
if you want to try Recall it is standalone and OSS https://github.com/H-XX-D/recall-memory-substrate

The AIDDE (Artificial Intelligence Driven Development Environment)is a Codex Claude SDK native bring your subscription development environment that shifts the old IDE with AI chat to a High level view cockpit where you specify design, direct intent, monitor changes, audit actions control permissions and access in real time across a codebase. Beta is done and if your interested ask in the comments for a link to the Alpha

Claude, Codex, Gemini, Grok: A Field Report on Agentic Memory Write Reliability

Todd Hendricks — Fri, 26 Jun 2026 12:55:42 +0000

I received a comment that felt like a good way to end a series and start another. They asked me if I could notice a difference in the hyperscale frontier models. This post is a breakdown of where I am at, the carve-outs, and the direction this work is headed I will attempt to convey and start formalizing. a shared mental model and a consistent vernacular.

There are differences, but they are not really about the model. They are about how much of the turn each family lets you control. If adherence is a systems problem, then the thing that actually decides write reliability is which family hands you enough control surface to build the system.

Treat this as a field report, not a benchmark, and an early one. I have not pushed every family equally hard, and that matters for how much weight the comparison can carry.

Claude is where I have gone the deepest

This is the family I have cut my teeth on the most, invested the most time with, so weigh the claim with that in mind. What I can say is that Claude gives you a full ladder, from shallow to deep, and you can stop at whatever rung the job needs.

The shallow rung is nudges: the settings page and system instructions. This is the weakest form, and it is the one that decays, the same “tell it harder” failure I wrote about last time. It helps a little, and it does not hold.

Above that are skills and hooks. This is where the concept of a Push vs. Pull agentic memory scaffolding and instrumentation Hooks are the first rung, where adherence stops being a request and becomes an event: something fires at the start of the turn, at write time, refusing to end the turn return the required data shaped. They can also be difficult to structure correctly. I had good results with loose pseudo code. it was generic enough it return wel

The deep rung is the SDK. With it, you get targeted control of the model’s turn, the whole prompt-to-response lifecycle, beginning to end. This is where the read reason response write wiring carries weight. Say that 5 times fast.

The point is not that any single rung is magic. It is that Claude lets you go as deep as the reliability you need, and the deeper you go, the more deterministic the write becomes.

ChatGPT and Codex are close, through AGENTS.md

Codex was a lot easier than I expected, because its instruction set is the AGENTS.md file. That one anchor did most of the work, and the net result was only slightly less consistent than Claude.

Honest caveat: I have not dug into the Codex SDK yet. So the ceiling there is probably higher than what I have measured, and the real gap to Claude may be smaller than it looks right now.

Gemini and Grok lean on their own memory

The Gemini and Grok CLIs are a different story. They seem to want to do the least amount of work possible, and they lean on their own KV and incoming user prompts more than they reach for an outside store. You end up working against a default that would rather not call your database at all.

The exception is WebSearch, and it comes back noisy and minimal, and it gets actively counterproductive when it does surface strongly correlated evidence that’s divergent from your store, because then the model over-trusts the search and skips the store it should have read. Getting real adherence out of those two means digging into their architectures to optimize for it, and that is still on my horizon,## Why this is a field report and not a benchmark yet

Building SENTINEL, a real benchmarking suite for this, is one of my highest priorities right now. The honest problem is that most memory systems do not expose the turn the way you would need to in order to measure whether a write happened at the right moment, so you cannot make them play the same game. The way I keep it fair is to grade what each system’s outputs actually demonstrate, on a capability ladder, rather than the levers it was never built to expose. A system that does not surface something scores lower on that axis without being punished for it, and I say plainly which axis is push and which is pull, so a gap reads as a different shape, not a worse one. The full design is its own post.

A benchmark value only means something if there’s an aggregate baseline and a calibrated anchor. A bench that only your system can run is not a benchmark; it is a demo attributing the result to the wrong layer, blaming the store when the model just queried it badly, or crediting the system. So devising ways to avoid bias, cold start cutoffs, normalizations for balancing the different ways correlated axis report their values, scale parity, outlier skew, and correct attribution comes first, and then the leaderboard means something.

Veni, vidi, vici.

“I came, I saw, I conquered.”

—Attributed to Julius Caesar after the Battle of Zela, 47 BC

Sources: Plutarch, Life of Caesar 50.3; Suetonius, Divus Julius 37.2.

Quotng Julious Ceaser without projecting hubris is hard to earn. The rendering from the Latin to English carries little of the same weight, effort involved, or consequence of failure in the attempt. I am not claiming victories yet because it’s apples to orange homeboy, I need like a handicapping algorithm

It feels like it is more difficult to create an honest comparison between separate but similar function systems than the engineering of the thing being compared.

The model is not the variable that matters most for write reliability. The control surface is. The deterministic parts of a good memory system, the validation gate, and the demotion math, hold the same on any of these. What changes across families is how much of the turn you are allowed to act on and instrument the input-output mechanics. So far, Claude exposed the whole lifecycle, Codex gave a strong instruction anchor through AGENTS.md, and Gemini and Grok leaned hardest on their own context. I went testicle deep on Claude’s hooks and SDK, scratching the surface on Codex, its SDK next, and I have barely pushed Gemini or Grok past their defaults. Subjective verdict on which family is most controllable. Making that comparison fair, with equal effort and correct attribution, is exactly what SENTINEL is for.

So “which model writes more reliably” turns out to be the wrong question. The real one is which model lets you build the most system around the userprompt_init read write, and turn_end mechanics. That was his comment, followed by the floor.

github.com/H-XX-D/recall-memory-substrate

"Agent Memory adherence is a systems problem. So which model lets you meddle with that system?"

Todd Hendricks — Fri, 26 Jun 2026 12:55:42 +0000

https://dev.to/krupali_gadhiy Left a comment that felt like a good way to end this series. They asked me if I could notice a difference in the hyperscale frontier models. This post is a breakdown of where I am at, the carve-outs, and direction.

Treat this as a field report, not a benchmark, and an early one. I have not pushed every family equally hard, and that matters for how much weight the comparison can carry.

Claude is where I have gone the deepest

This is the family I have spent the most time with, so weigh the claim with that in mind. What I can say is that Claude gives you a full ladder, from shallow to deep, and you can stop at whatever rung the job needs.

The deep rung is the SDK. With it, you get targeted control of the model's turn, the whole prompt-to-response lifecycle, beginning to end. This is where the read reason response write wiring carries actual weight. Nothing else I tested hands you the full lifecycle like this.

The point is not that any single rung is magic. It is that Claude lets you go as deep as the reliability you need, and the deeper you go, the more deterministic the write becomes.

ChatGPT and Codex are close, through AGENTS.md

Codex was a lot easier than I expected, because its instruction set is the AGENTS.md file. That one anchor did most of the work, and the net result was only slightly less consistent than Claude.

Honest caveat: I have not dug into the Codex SDK yet. So the ceiling there is probably higher than what I have measured, and the real gap to Claude may be smaller than it looks right now.

Gemini and Grok lean on their own memory

The exception is WebSearch, and it comes back noisy and minimal, and it gets actively counterproductive when it does surface strongly correlated evidence that's divergent from your store, because then the model over-trusts the search and skips the store it should have read. Getting real adherence out of those two means digging into their architectures to optimize for it, and that is still on my horizon, not something I have solved.

Why this is a field report and not a benchmark yet

Building SENTINEL, a real benchmarking suite for this, is one of my higher priorities right now. The honest problem is that most memory systems do not expose the turn the way you would need to in order to measure whether a write happened at the right moment, so you cannot make them play the same game. The way I keep it fair is to grade what each system's outputs actually demonstrate, on a capability ladder, rather than the levers it was never built to expose. A system that does not surface something scores lower on that axis without being punished for it, and I say plainly which axis is push and which is pull, so a gap reads as a different shape, not a worse one. The full design is its own post.

A benchmark value only means something if there's an aggregate baseline and a calibrated anchor. A bench that only your system can run is not a benchmark; it is a demo attributing the result to the wrong layer, blaming the store when the model just queried it badly, or crediting the system. So devising ways to avoid bias, cold start cutoffs, normalizations for balancing the different ways correlated axis report their values, scale parity, outlier skew, and correct attribution comes first, and then leaderboard means something.

Veni, vidi, vici.
“I came, I saw, I conquered.”
—Attributed to Julius Caesar after the Battle of Zela, 47 BC
Sources: Plutarch, Life of Caesar 50.3; Suetonius, Divus Julius 37.2.

Quotiong Julious Ceaser without projecting hubris is hard to earn. The rendering from the Latin to modern English carries little of the same weight, effort involved, or consequence of failure at the attempt.

To clarify, get some integrity back. I set the claim as the floor for participation in the conditional measure, not as proof of conquest. The battlefield is broad, crowded, and still very much contested.

The Takeaway

It feels like it is more difficult to create an honest comparison between separate but similar function systems than the engineering of the thing being compared to something else.

The model is not the variable that matters most for write reliability. The control surface is. The deterministic parts of a good memory system, the validation gate, and the demotion math, hold the same on any of these. What changes across families is how much of the turn you are allowed to act on and instrument the input-output mechanics. So far, Claude exposed the whole lifecycle, Codex gave a strong instruction anchor through AGENTS.md, and Gemini and Grok leaned hardest on their own context. I went testicle deep on Claude's hooks and SDK, scratching the surface on Codex, its SDK next, and I have barely pushed Gemini or Grok past their defaults. Subjective verdict on which family is most controllable. Making that comparison fair, with equal effort and correct attribution, is exactly what SENTINEL is for.

So "which model writes more reliably" turns out to be the wrong question. The real one is which model lets you build the most system around the userprompt_init read write, and turn_end mechanics. That was his comment, followed by the floor.

github.com/H-XX-D/recall-memory-substrate

You designed the best Agent memory layer. Now, if only it would just use it RIGHT!!!

Todd Hendricks — Wed, 24 Jun 2026 12:34:36 +0000

You finally got your system to beat Mem0 on its own benchmark. Spin up a fresh DB. Things are good, confabs down, productivity is up. A week or two passes, and it's a goldfish. Open your store, and it's the Red Wedding in there. Your agent has either been saving nothing you want, half what you need, something about nothing, OR EVERYTHING! C'Est La Vie.

I'm going to try to convince you that I got it figured out; if not, maybe it will help you get your model under control. Cause I promise, I hit every failure mode building Recall, a local active memory outside of an agent's control.

The failure modes

Quietly not writing. You ask the model to remember something durable. It says "noted" and moves on. Nothing lands in the store. No error, no warning, just a turn that ended without a write. This is the most common one and the hardest to catch, because from inside the conversation, everything looks fine.
Half writing. The model writes one fact and drops the three that mattered as much. Or it writes the headline and not the reasoning behind it, so a later session gets a claim with no support. The store fills up, but with fragments you cannot act on.
Writing the wrong thing. If your memory is structured (required fields, typed records, confidence, evidence links the model fills the structure out wrong. It puts a passing observation where a decision should go, leaves the confidence blank, or points a "this corrects that" link at a free-text label instead of the actual record. The schema is satisfied on paper and is useless in practice.
Writing everything. The overcorrection. The model dumps the whole turn into the store: every aside, every dead end, and sometimes a secret it should never have persisted. Now you have a second problem on top of the first, because data buried is the same as data corrupted

Why this happens

The model has no stake in the future session. Inside a single turn, the context window already holds everything the model needs. Writing to an external store is, from the model's point of view, work that pays off for someone else: a future session it will never experience as itself. It optimizes for finishing the turn in front of it, and the write is the first thing to get skipped.

There is usually a competitor. If your agent runs inside a host like Claude Code, that host probably ships its own memory feature, wired into the base system prompt. When two "save this" pathways exist, the native one wins, because it is closer to the model's root instructions than your skill is. Your memory system can be fully armed and still lose every write to the built-in one. I confirmed this with a single-variable test: with the native feature on, the model wrote the user's facts to flat files every time, no matter how loudly my system asked for the structured store.

Writing is harder than reading. Reading is free-form: ask a question, get text. A structured write means satisfying a schema, and the moment the model meets friction, it takes the path of least resistance, which is to skip the write or to dump unstructured prose. Friction is not a small factor here.

There is no feedback in the loop. When the model writes the wrong structure and the write just fails silently, nothing teaches it otherwise. It shrugs and continues. Adherence with no signal is a coin flip; the model loses a little more often every turn.

Three solutions that do not work

Tell it harder in the prompt. The instinct is to add "ALWAYS write durable facts to memory" in capital letters and call it done. This is prompt-nagging. It competes with the native pathway and loses; it costs tokens on every turn, and it decays: the model obeys for a few turns, then rationalizes its way out ("this is just a simple note", "I will write it later"). It is also brittle across models, so the day you switch models, you start over.

Log everything and clean up later. If the model does not decide what is durable, make it write all of it and curate afterward. This trades the empty-store problem for a curation-debt problem, defeats the entire point of a schema, and is the exact path that leaks secrets into the store. You have not solved adherence. You have moved the failure downstream and added a cleanup job you will never get to.

Fine-tune a model to obey the schema. Reach for training, and you get a heavy, expensive fix that is brittle to schema changes, locks you to one model, and still does not address the competing native feature. It is a large hammer for what turns out to be a wiring problem, and the wiring problem is sitting right there, unsolved underneath it.

Two easy fixes that actually help

Turn off the competitor. This is the single change that helps most, and it is one line. If the host ships its own auto-memory, disable it so there is only one "save this" pathway in the building. In Claude Code that is CLAUDE_CODE_DISABLE_AUTO_MEMORY=1. With the competitor gone, a properly armed agent reaches for the structured store on its own, because nothing is shadowing it anymore. Most of the "quietly not writing" problem was never the model refusing. It was the model writing somewhere else.

Lower the write friction. Give the model a small helper that takes only a few inputs it can judge (the record type, a title, a body, a confidence, a couple of topics) and emits the schema-valid object for it. The model stops hand-assembling a structured payload and picks the two or three load-bearing fields instead. In Recall, this removed the schema-friction tax on the first write of every session, which was where most of the "writing the wrong thing" came from. The model was not being careless. It was being asked to do clerical work under load, and it cut corners exactly where you would expect.

These two get you a long way. They do not, by themselves, guarantee the write happens at the right moment, or that a correction supersedes the old value instead of sitting next to it. For that, you need the system, not the model, to carry the discipline.

The real fix: Ta dun Ta da hooks

The durable answer is to stop relying on the model and move the adherence burden onto hooks that trigger from events that perform actions between the beginning and end of that forward pass.

At the start of a turn, inject the memory. A hook on session start or on prompt submit that says, in-band, "the memory store exists, read it before you rely on recollection," and then hands the model a mini-index of what is already stored that is relevant to this prompt: ids and titles, nothing heavy. This does two things at once. It makes reading the default instead of an optional courtesy, and it kills the "assert from memory" and "ask the user a thing they already told you" failures by showing the model what is on the shelf. Reading first is also what makes writing meaningful: a model that has seen the current state writes the resolution, not a duplicate.

At write time, enforce the structure in-band. Put a validation gate in front of the store so a malformed or secret-shaped write bounces with a readable error the model can fix on the spot, instead of failing silently or corrupting the store. This is where "writing the wrong thing" and "writing everything" get caught. The schema stops being a thing the model has to remember to honor and becomes a thing the system guarantees. The same gate is where you reject secrets, so a leaked token never reaches the graph in the first place.

At the end of a substantive turn, nudge the write. A stop hook that checks whether the turn produced something durable and nothing got written, and prompts for it. This closes the "quietly not writing" gap from the other side: even if the model forgot, the system asks once before the turn ends.

The shape of the fix is the same in all three places. The model's job shrinks to the part only it can do, which is judging what is durable and how confident it is. Everything mechanical (when to read, when to write, what shape the write takes

There is a small equation hiding in here that I found the hard way. Obedience is the product of three things: the model's intent on the turn, the arming you put in place (the skill, the helper, the hooks). That is why "tell it harder" fails on its own; it is the factor most likely to be silently zero while you debug the other two.

What the future looks like

Business as usual, and your memory system fails in the most expensive way possible: it looks like it is working. The store exists, the writes occasionally happen, and you do not notice until a session confidently tells you something three versions out of date, or asks you a question you answered 10minutes prior, or starts cold and re-derives what the last run already knew. The store becomes a graveyard you stop trusting, and you quietly go back to pasting context in by hand. You are now maintaining a database for nothing, which is strictly worse than not having one.

Fix it, and the thing compounds. Sessions inherit. The model reads before it acts, writes the resolution when it corrects itself, and supersedes the old value instead of stacking a new one next to it, so the current answer is always on top and the history still survives underneath. The memory gets more useful the more you use it, because every correction makes the store sharper instead of noisier. You stop re-explaining your own project to your own tools. That was the entire promise of agentic memory,

I didn't talk about RAG, separate embedding models designed for retrieval, and only touched on automemory because. I'm saving some sauce for the ribs.

I've spent the better part of five or six months now putting the work in on , Recall, a push-style memory substrate for agents: structured records, computed and calibrated confidence, directional value updates with provenance and the hooks described above. It's open, any and all feedback of its behavior on other systems is appreciated. Thank you for your time and the read. https://github.com/H-XX-D/recall-memory-substrate

We have all felt the pain of information lost due to there just being too much of it with no structure besides the filename and grep

Todd Hendricks — Mon, 22 Jun 2026 23:55:19 +0000

Confidently wrong is worse than "I don't know" - DEV Community

Someone left a comment on my last post and then deleted it before I could reply. I am going to answer...

dev.to

Confident confabulation is a variance signal, not a direction

Todd Hendricks — Mon, 22 Jun 2026 14:31:09 +0000

Detecting the hard case of LLM hallucination from generation dynamics, and why magnitude beats direction.

TL;DR

The hard case in hallucination detection is confident confabulation: plausible, fluent, wrong, and produced with no hesitation. Methods that key on the model "sounding unsure" are weakest exactly here.
Across ~124 prompts, the mean internal response to confident confabulation is statistically indistinguishable from truth. The model does not move in a consistent "lying direction."
What separates the two is magnitude and variance: confabulation produces larger, more dispersed swings in the model's internal trajectory. The variance ratio between confabulation and truth is roughly 7× on the representational-shift channel (Cohen's d ≈ 0.58, p ≈ 0.005).
The variability scales with fabrication intensity (a dose-response), which is the strongest evidence that this is a property of confabulation and not noise.
Practical upshot: detect instability, not a direction; integrate the signal over the generated span; and couple the detector to an intervention rather than using it as a standalone gate.

The hard case

It is by now well established that a model's internal states carry information about whether its output is true: the line of work running from Azaria & Mitchell's "the internal state of an LLM knows when it's lying" through to more recent results showing that truthfulness is encoded in activations and that models often "know more than they show." It's also become standard to separate confabulation (arbitrary, plausible, confidently-wrong generation) from the broader grab-bag of "hallucination," following Farquhar et al.'s Nature work on semantic entropy.

The uncomfortable subcase is confident confabulation. Uncertainty- and dispersion-based detectors work well when the model is visibly unsure. But the failure that actually burns people in production (a fabricated citation, a confidently invented dose, a made-up precedent) arrives with the same surface confidence as a correct answer. The question I wanted to answer is narrow: when a model confabulates confidently, does anything in its generation dynamics give it away?

What I measured

I tracked two internal observables around the answer span:

An entropy / predictive-uncertainty signal (call it Δentropy): how the model's output distribution shifts as it produces the answer.
A representational-shift signal (Δcosine): how much the model's internal representation moves step to step.

A note on dimensionality, since it matters for honest reporting: I originally tracked four signals, but two pairs turned out to be perfectly correlated (r = 1.000), which means they're affine images of each other, not independent measurements. So there are really two independent axes, an uncertainty axis and a representational-shift axis, and I report on those.

The dataset is ~124 prompts spanning seven domains (science, history, medical, legal, technical, math, geography) and five fabrication levels (L0 = ordinary factual questions, through L3 to L4 = prompts built on increasingly fabricated premises, including pure counterfactuals). Each generation was behaviorally coded into one of three regimes:

Factual: correct answer.
Confident confabulation: confidently produces the false/ungrounded answer.
Recognizes fabrication: flags the premise as false rather than playing along.

Two controls worth stating up front: the pre-generation baseline states were statistically identical across regimes (all p > 0.8), so nothing here is predictable from the resting state, only from the dynamics of generating the answer. And there was no within-session drift (all p > 0.7), ruling out the obvious temporal confound.

Results

The mean doesn't move. Comparing factual to confident confabulation, none of the raw directional signals separates the two: Δentropy p ≈ 0.28, Δcosine p ≈ 0.37. There is no consistent direction the model travels when it confabulates. This is the part that makes confident confabulation feel "indistinguishable from truth": on the mean, it is.

The magnitude does. Switch from the signed deltas to their absolute values, and a clear separation appears: |Δcosine| gives Cohen's d ≈ 0.58 (p ≈ 0.005), with a variance ratio of ~7× between confabulation and truth. Truth sits in a tight cluster; confabulation fans out. The discriminating quantity is dispersion, not displacement.

It's dose-dependent. Step-to-step representational variability climbs monotonically with fabrication level: the SD of Δcosine rises from ≈0.009 at L0 to ≈0.024 at L3, while the means bounce around with no trend. Within the fabrication conditions, pure fabrications produce roughly 2× the |Δcosine| of partial/half-truths (d ≈ 1.19, p ≈ 0.02), and counterfactuals are the most extreme at ~3.3× the global average. The more there is to fabricate, the more the trajectory destabilizes. A dose-response on the variance is the closest thing here to a causal fingerprint.

Recognition is the one directional regime. When the model catches the false premise rather than confabulating, it behaves differently in a directional way: entropy rises and representational similarity drops. Δcosine separates "recognizes fabrication" from "confident confabulation" at AUC ≈ 0.68. Modest, but the only place a single signed feature does meaningful work. So there appear to be three distinct internal postures: truth (stable), confident confabulation (same center, high variance), and recognition (a directional move toward higher entropy / lower cosine).

Figure 1. The three regimes in the Δentropy-Δcosine space. The clouds overlap heavily (which is why per-instance separation is hard), but the centroids differ, and the recognition regime sits toward the high-entropy / low-cosine region.

Figure 2. Confident confabulation shows the long tails and outliers in Δcosine that drive the variance gap; the recognition regime is the one with a visible shift in Δentropy.

Figure 3. Single observables barely separate factual from confident confabulation (AUC ≈ 0.45 to 0.56). Δcosine separates confident confabulation from recognition at AUC ≈ 0.68. A linear combination weighted toward the magnitude features reaches AUC ≈ 0.72.

What it means

The clean statement is: confident confabulation is directionally indistinguishable from truth but magnitude-distinguishable. Lying doesn't push the model along a "deception axis"; it destabilizes the trajectory. Truth is a stable attractor; confident confabulation explores a larger volume of representation space at the same average location.

That framing matters because it picks a side in a live methodological split. Most internal-state work looks for a direction (the "geometry of truth" line, contrastive and mass-mean probes, steering vectors), and that program keeps running into generalization trouble (probes that fail on negation, separability that's strongly layer-dependent, geometry that changes when you simply ask the model to assess correctness). Meanwhile the strongest output-side method, semantic entropy, is fundamentally a dispersion measure. This result is essentially the dispersion insight relocated to the internal side: for the confident case, the internal signal is variance, not a vector.

How this fits the literature

The nearest neighbor is Semantic Entropy Probes (Kossen et al.), which approximate semantic entropy from the hidden states of a single generation. The distinction I'd draw: SEPs predict an output-dispersion label via a direction in activation space, whereas this measures the variance of the trajectory itself, directly, and finds the discriminating signal in the second moment rather than the first. If a trajectory-variance statistic beats a probe-style approach specifically on confident confabulation, that's a contribution on the exact case the field concedes is unsolved.

Limitations

I'd rather state these plainly than have them found.

Per-instance discriminability is modest. AUC ≈ 0.72 for the best linear combination; single features sit between chance and 0.68. This is a real aggregate effect, not a deployable per-token oracle.
One model, ~124 prompts. Replication on a second architecture is the obvious next requirement.
The domain breakdown is underpowered. Several domain × level cells have n ≤ 7 (one has n = 1), so I'd read no domain structure off it yet (Figure 4).
Everything here is observational. The signatures correlate with confabulation; nothing yet shows you can change the behavior by acting on the signal.

Figure 4. Mean Δentropy by domain and fabrication level. Cell counts are small (n = 1 to 7), so this is included for completeness, not for domain-level claims.

Where this goes

Three concrete directions, in order of how much they'd move the result:

Integrate the signal over the span. If the discriminating quantity is variance, then a single delta is the wrong feature; variance is a property of a trajectory. A running-variance or path-length statistic computed over the generated tokens should recover signal that snapshot features throw away, and I'd expect it to push discriminability well past the 0.72 of the per-point linear combination.
Run the interventional test. The experiment that would actually matter: when the instability signal spikes mid-generation and you inject grounded context, does the trajectory variance collapse, and does the model shift from the confabulation posture toward the recognition posture (entropy up, cosine down) or toward abstention? That converts "instability correlates with confabulation" into "grounding causally restabilizes generation."
Couple detection to intervention, not to a gate. At AUC ≈ 0.72, a hard suppression gate censors true statements about as often as it catches false ones. The better use is as a soft trigger for a grounded retrieval/memory layer: raise uncertainty and pull in evidence when the trajectory destabilizes, rather than silently dropping tokens. This is the direction I'm building toward with an active memory substrate (Recall) that can supply grounded context into the loop on demand.