Todd Hendricks

Posted on Jul 2 • Edited on Jul 4

I asked the Fable 5 which memory it would rather run on

#ai #llm #agents #buildinpublic

I switched my terminal over to Fable 5, Anthropic's new frontier model, and put a blunt question to it: you have run on Claude Code's built-in auto-memory, and you are running on my memory substrate right now. Can you actually tell a difference?

It said yes, and it brought a receipt from twenty minutes earlier in the same session.

I had asked it a mundane question: what is my next scheduled blog post. My content calendar lives in the memory graph, and the graph held two versions of it: an older cell with the original ordering and a newer cell that resequenced it, linked by a contradicts edge. The compile packet the model reads at the start of a turn includes a conflicts section, and that section flagged the old ordering as challenged before the model ever quoted from it. Its own summary: with auto-memory, the stale calendar comes back as flat text with nothing marking it superseded, and I would have confidently given you the wrong post.

That is the difference in one sentence. One system hands the model facts with epistemic state attached: confidence, challenged, stale, who wrote it and when. The other hands it text it has to take at face value.

Discount the self-report first

Before quoting a model's opinion about anything, name the confound: models tend to agree with the framing of whoever is asking, and I built the thing I was asking about. So treat the interview as color, not evidence.

The evidence is a store-level battery with no LLM in the scoring loop. Seven deterministic scenarios, run against two stores: a faithful flat-file model of Claude Code's auto-memory (a real MEMORY.md index plus per-fact .md files with metadata frontmatter, overwrite-in-place on correction) and the real recall CLI on an isolated database.

Audited score: Recall 6.5 of 7, auto-memory 3.5 of 7. The audit is the part I trust most. The first run scored 7 against 2.5. Three independent agents then reviewed the harness adversarially, called it mildly biased toward Recall, found a genuine false positive, and the corrected run is the number I keep. A benchmark that got less flattering after an audit is worth more than one that never had one.

And the honest reading of the split: auto-memory ties on basics. In a separate agent-level A/B, both stores went 3 for 3 on simple current-value questions. Flat prose can simulate simple supersession fine. The gap opens at scale and on corrections. My graph currently holds about 1,180 cells with 86 tracked contradictions. A flat file at that size is a pile of sentences, and no sentence in it can answer "what changed since Tuesday" or "which of these beliefs is contested," because those are questions about the store's state over time, not about any fact inside it.

The four mechanisms that do the work

Supersession instead of overwrite. When a fact is corrected, auto-memory overwrites the old file in place. The history is gone, and a corrected fact is indistinguishable from a never-wrong one. In the graph, the new cell carries a contradicts edge to the old one; the old cell's effective confidence drops and it stays visible as superseded. The correction is recorded as a resolution, not a deletion.

Per-prompt push instead of load-at-start. A hook compiles a small index of relevant cells, ids and staleness flags included, into every prompt. The model does not have to remember to look; the current state of the graph re-enters its context each turn. Auto-memory loads once at session start and then drifts.

Ids-first reads instead of whole-file loads. The compile packet returns handles. The model expands the two cells it needs instead of ingesting the whole store and hoping the relevant paragraph survives.

Questions about the store itself. What changed in the last day. What is stale. What is contested. These are answered by diff and health tools reading timestamps and edges. There is no flat-file equivalent, not because nobody wrote one, but because the file does not contain the information.

The long-run answer

Then the follow-up I actually cared about: for a long-running task, which would you rather have underneath you?

Its answer, compressed: what kills a long session is context compaction. The window gets summarized, the summary is lossy prose, and nothing marks what got dropped or corrected along the way. A memory that loads at session start does not help mid-task. The per-prompt push re-anchors the model after every compaction, from the graph rather than from whatever survived the summary. And long tasks accumulate corrections: something believed in hour one gets falsified in hour three, and the specific way long autonomous runs die is an agent confidently resuming from a belief nobody told it was stale.

It conceded the cost without being asked, which I appreciated. The write discipline burns tokens every turn, and on a ten-minute task it may never pay back. On a long run it amortizes, and the writes double as an audit trail of what the agent did and why.

One operational footnote if you want to reproduce any of this: Claude Code's native auto-memory shadows an external store while it is on. We tested arming the agent every way we could think of; it kept writing flat .md files regardless. CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 is the switch. The two do not coexist.

The two questions, again

Strip the interview away and you are left with the two questions I keep coming back to. Is your agent actually using your memory, or a shadow store sitting next to it? And if a fact in that memory were wrong, would anything in the system know?

A self-report from a frontier model is a data point, not a verdict, and this is a field report from one stack, half of which I built. Run the two questions against your own setup and tell me where I have it wrong: https://github.com/H-XX-D/recall-memory-substrate

Top comments (12)

Mike Czerwinski • Jul 3

Triangulating from the other side. My substack post named markdown's three failure modes: loading (no retrieval router), precedence (files quietly disagree), staleness (a snapshot doesn't know it aged). Argued they weren't file-level bugs but abstraction-ceiling ones.

Your four mechanisms are what living above that ceiling looks like. Supersession-as-edge instead of overwrite. Per-prompt push instead of load-at-start. Id-handles instead of whole-file reads. Questions the store answers about itself. None of them can be simulated by a fifth markdown file, because they aren't about content. They're about relationships between content that the file abstraction has no place to carry.

Two things I'd steal specifically. The audit that made the benchmark less flattering than the first run. That is the receipt shape I keep pushing for, and "worth more than one that never had one" is exactly the ethos. And the CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 footnote. Shadow stores are a class of failure I hadn't named yet.

The two questions you leave people with are the honest ones. Adding a third: if a fact were wrong, would anything in the system know it needed to know, or does someone have to remember to ask.

Todd Hendricks • Jul 4

corrections to wrong information gets superseded at write time and teh standup grams like tripwire quorum trend can track a a fact loosing support with a trigger gate or slope trigger to notify a model the world has changed for what it thought was truth when that fact was created.

Mike Czerwinski • Jul 5

A slope trigger on "losing support" is a good instrument, but worth being precise about what it's an instrument for. If "support" means agreement among raters who all read the same corpus, a declining slope tells you the model's own confidence is wobbling, not that the world moved. Those are different events and call for different corrections: wobbling confidence needs a tie-breaker, an independent reader; an actual world-change needs a new observation the standing corpus doesn't have yet.

The dangerous case is a fact that's still fully supported internally, every rater agrees, because nobody's fed the raters the new observation yet. The slope trigger stays flat right through the moment it should have fired, because nothing in the consensus itself changed, only the world outside it did. So the trigger probably needs two channels: one on internal consensus, which catches wobble, and one on the freshness of the sources feeding the raters, which catches staleness the consensus can't see because it's consensus about a stale snapshot.

Todd Hendricks • Jul 5

Or a different standing program I have them for quorum drift trend or2 and2 lut5 and these are suggested at the write firewall and the schema carries a standing program membership becuase its relations are already in context

Mike Czerwinski • Jul 6

I want to make sure I'm tracking this right before responding to the specifics, a couple of the terms here (quorum drift trend, the or2/and2/lut5 markers) aren't ones I can confidently parse. If I'm reading the shape correctly: you're proposing the schema itself carries standing membership in a drift-tracking program at write time, so the relations needed to evaluate later staleness are already sitting in context rather than reconstructed after the fact. That's a real difference from the two-channel split I described, front-loading the tracking relationship at write time instead of computing it at read time. If that's the intent, the tradeoff I'd want to understand is what happens when the program itself needs to change, does a fact written under program membership V1 silently keep evaluating against V1 rules forever, or does the program versioning also need its own staleness check.

Todd Hendricks • Jul 6

the firewall is to the end turn hook and requires that the model replace all the schemas:values* default value instructions/description that key does" and part of that is the daemon controls and becuase it still carrying the fresh context it just used to do everything it knows the cells the incoming prompt delta is effecting and the other standing programs that the the other cells are members of and becuas teh session start primes the context window with clear instructions and rails it understnd how and when to connect these subprograms on every turn so as exchanes and work gets done its sccumulating the triggers trends with triggers and trend flags gates user lead email shot slack blast yadda yadda notifications theres alot that you can do really and its just arithmatic so subms fast on a cpu im up to around 6300 cells and 100 200 tripwires active the global db scopes to the project db the folder its in the path back to home so every remains accessible but an extra step to reach

Todd Hendricks • Jul 6

auto-memory for claude and codex and been wiped and off for 3 weeks + plus and I dont remeber what it was like not this running and would be handicapped hard having to fallback

Mike Czerwinski • Jul 6

I read through your other replies and skimmed the AURA repo, and I want to be honest before I try to respond: the vocabulary (cells, tripwires, standing programs, MAL, HAL, primitives, gates, subprograms) may cohere internally, but from outside there's no anchor point, every term is defined by others in the same set. Before I can respond to any specific claim, I'd need one paragraph that reads for someone who hasn't been in your head. What is a "cell," what mutates it, what is a "tripwire" (a database trigger, a scheduled query, a hook), what problem does the composed system solve that couldn't be solved without inventing the terms. Not a critique of the ideas, I can't tell yet, but a translation cost that's currently entirely on your reader.

Todd Hendricks • Jul 7

That is an outstanding idea, HAL is the Hardware Abstraction layer from Linuxcnc it ties physical machine to the digital control plane of a CNC robot it is one writer many readers for safety and deterministic traceable machine behavior

HAL example
net spindle-on motion.digital-out-00 parport.0.pin-17-out

reads as (net) assigns the signal generated by the digital component (motion.digital-out-00) to the signal name (spindle-on) and sends it to the physical hardware pin of the machine (parport.0-17-out)

or2 is a component like motion.digit its a predefined program/script that connects to 2 separate signals and if either trigger it creates new 3rd signals that sent to something else like LED or turns on dust collection and2 similar bur requires both signals to be actively triggered before it sends a new signal

MAL is a term I coined Memory Abstraction Layer its function is operationalize the keys:values of the json schema stored as a row sqlite row of the memory database.

Main difference is many it is many writers one reader. they both compose as single line with defined notation using symbols like . - _ < > and sentence structure.

MAL example:

addf watch tick [topics: api-status] [measure: effective_confidence] [delta: 0.20] [limit: 50]

reads as addf watch tick schedules a watchdog on the tick loop that tracks if the effective confidence of the cells(sqlite rows) tagged with "api-status" changes by + or - 0.20 since turn previous turn with a limit of 50 members that sentence will trigger a flag notify the model that truth off its stored memory state just changed by a large amount.

same example in python

def watchdog_mal(query, delta=0.15, measure="effective_confidence", concern_target=None):
parts = [
"addf watch tick",
f"[query: {query}]",
f"[measure: {measure}]",
f"[delta: {delta}]",
]
if concern_target:
parts.append(f"[concernTarget: {concern_target}]")
return " ".join(parts)

print(watchdog_mal("payment gateway outage", concern_target="checkout"))

You gave a a great idea this iis directly useable as a framework with dialects GAL: Graph Abstraction Layer
TY

Mike Czerwinski • Jul 7

Cells + tripwires + programs external to cores is the closest anyone in this thread has come to the shape I was pointing at. Programs composed from schema keys, computed on ticks after model write turns, that split is what markdown alone cannot express: store is data, retrieval is code, code does not mutate what it reads.

One question if you have not solved it: bi-temporal timestamps on cells. Written-at vs last-confirmed-against-current-schema. Schema drift is the silent failure. A cell true when the tripwire fired and still on disk after the tripwire evolved is invisible to both store and runtime unless one carries both dates.

Todd Hendricks • Jul 6

the programs are constructed from the schema keys so they are freely composable because the are outside the cores and don't mutate the store they are computed on ticks which follow the model write turns.. write a new one, thats what the MAL idea sprung from the programs are the hal components and the primitives are pins signals are values is being strung together run them

View full discussion (12 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.