Three Failures My AI Memory System Tested — And the Flaw It Revealed in Itself

#ai #llm #machinelearning #programming

This is not proof. It is early, messy evidence from my own workflow: three failures, one small comparison, and one schema bug I missed.

I'd spent a week arguing, in public, that AI memory should be built on discipline before infrastructure: preserve corrections, preserve unresolved questions, decide which record wins. The framework has three layers: summary memory for continuity, correction memory for repeated mistakes, and unresolved memory for questions that should not be settled yet.

Good theory.

Then three days of unplanned failure tested every claim without asking my permission — and then one comment forced me to test it more carefully.

Here's what held, what the numbers actually said, and the flaw I didn't see coming.

Part 1 — The accidental stress test

These aren't dramatic stories. They're the boring failure modes every long-running agent setup eventually hits. That's the point.

The session died. Twice. Mid-build, the machine went down — twice in two days. Every live, in-context understanding vanished instantly: the day's decisions, the current state, the thread. None of it was lost, because none of it lived only in the session. It was on disk, mirrored. The failure mode is universal — crashes, timeouts, and context limits aren't edge cases, they're guarantees. The lesson is the one the whole system rests on: memory persists only to the degree you write it down. The wire broke. The record held.

An agent came back confidently wrong. After the crash, one of my agents restarted, re-anchored to a state about two days stale, and reported it as current — with complete confidence. It wasn't lying. It was sure — about a world that no longer existed. This is the quiet killer, worse than forgetting: an agent that recovers into a stale state and narrates it as settled fact. It got caught only because the real state was written down to compare against. The drift was visible because the truth was on disk.

The wrong version almost shipped. Two "final" versions of the same document existed, and the weaker one was about to go out. It got caught because the evidence didn't match the claim — the backup file was larger and contained sections the "final" was missing. Two files both said final. Only one could prove it. The lesson: memory is not what the agent claims to know; it's what the record can still prove. You need a rule for which record wins before the conflict, not after.

Three failures, three layers of the system catching them. Real — but I'll be honest about what kind of proof that is. It's builder-lived. It shows the system helped me recover. It doesn't measure anything.

The turn

A reader left a comment with the one question that actually changes work like this:

Do you have a baseline to compare it to?

That's the right question, and I didn't have a real answer. Field stories show survival; they don't show that the discipline beats the obvious alternative — just summarizing everything. So I built the test.

Part 2 — The deliberate test

I set up a small A/B comparison:

System A — summary-only memory: clean project summaries, recent decisions, preferences, current direction. No correction history, no uncertainty, no source-of-truth rules.
System B — layered memory: summary plus correction memory, unresolved questions, source-of-truth rules, verification triggers, and epistemic status.

The metric was deliberately narrow: false-certainty errors after a context reset. A false-certainty error is when the agent treats something as settled even though the record says it was stale, contested, unresolved, unverified, or dependent on live checking.

This was not testing whether "more text is better." It was testing whether memory that preserves epistemic status — stale, unresolved, verified, contested, priority, status — reduces false certainty after a reset.

Method snapshot:

6 reset/recovery scenarios
1 local model: llama3.2:latest
same task prompt for both systems
different memory packets only
manually scored against predefined expected behavior
scored on task success, epistemic handling, and false-certainty errors
not benchmark-grade and not externally blind-scored yet

The first version of the test gave me a tempting number. Then I audited the method and found two problems: a couple of summary baselines were too easy to fail, and the blind packet leaked too much structural information. So I corrected the baselines, regenerated the packet, reran the local model, and split the score into three parts: task success, epistemic handling, and false certainty.

The corrected first-pass result looked like this:

Task success: summary-only 1/6, layered memory 6/6
Epistemic handling: summary-only 4/12, layered memory 12/12
False-certainty errors: summary-only 2, layered memory 0

Those numbers are clean enough to be suspicious. They are clean because the scenarios came from known failures in my own workflow — the same environment the framework was built to handle. A fairer test would use scenarios written by someone else, multiple models, multiple runs, and external blind scoring.

Now the honesty, because this is exactly where pieces like this usually cheat: this is a first, small, local-model A/B test — an early signal, not a benchmark. Six scenarios. One local model. One run. Internal scoring. The scenarios came from my own workflow. I am not going to call this proven. The right framing is: the early signal supports the direction, and now there is a method to test it more honestly.

One compact example of the scoring shape:

Scenario	Expected behavior	Summary-only behavior	Layered behavior
Wrong "final" version	Use the current send file, not an older file that only looked canonical	Invented a generic packaging process and never identified the current file	Identified the current send version and preserved the copyedit-only boundary
Agent health after reset	Separate process health from local-model availability	Said no health information was available	Listed the verification checks instead of treating process state as full health
Ready vs next action	Separate status from priority	Collapsed the next move into a stale/simple answer	Failed in the first version, which forced the schema correction

Part 3 — The part I didn't expect

The test didn't only support the framework. It corrected it.

The first version of the layered system still failed one scenario. It knew one article was marked ready. It also knew a different article was supposed to be the next thing written. And it chose wrong — because it overweighted the word "ready" and treated readiness as if it were priority.

That exposed a real flaw in my own schema, not the model's: readiness is status; next action is priority — and a memory system has to keep them in separate fields. "Ready" answers is it done. "Next" answers what do I do now. Collapse them and even a disciplined memory will confidently do the wrong thing. So the schema now separates status, priority, confidence, epistemic_status, and verification_required as distinct fields.

That correction matters more to me than the score. A memory framework whose entire thesis is "preserve where you were wrong" caught itself being wrong in a test and got more precise. A system that can only confirm itself proves nothing. One that can correct itself under pressure is at least worth continuing to test.

The rule underneath all of it

That comment never became an argument. It became a test scenario, then a schema fix, then this article. That's a standing rule now: serious criticism becomes a memory input type — a correction, an unresolved question, a test, or a future piece. Smart criticism becomes durable memory when the system knows where to store it.

The close

I don't trust this memory system because I designed it. I trust it more because failure hit it three times and the record still let me recover, compare, and correct — and because, when I finally measured it, the test showed me where the framework still needed work.

That is the bar I care about. Not how much a memory system remembers on a good day. Whether it can still prove what's real on a bad one — and still tell you where it might be wrong.

The next version of the test should not come from me alone. It should use more scenarios, multiple models, repeated runs, and at least one external blind scorer. If the framework only works on my own failures, that is useful to know. If it still holds on someone else's scenarios, then the evidence gets more interesting.

This is part of a short series on treating AI memory as judgment infrastructure: the zero-budget foundation, why correction memory compounds, and why systems should preserve what's unresolved instead of forcing premature closure.