The sequel isn't about running or stopping. It's about whether the memory survives the stop.
That line came from a comment thread on The Token Economy. Someone named Kalpaka had been reading through the series — the stop signal problem, the authority over interruption argument, the architectural gap between what agents can do and what they do reliably and arrived at the question none of the pieces had answered.
You can solve the stop signal problem. You can build interruption authority into your architecture. You can define done before the session starts and give the agent a contract it has to satisfy before it terminates.
None of that answers what happens to the knowledge after the session ends.
The Graveyard Problem
Noah Vincent described it precisely for personal knowledge systems: "A week after consuming something, you could not explain what you learned if your life depended on it. The notes exist. The highlights are there. But you never use any of it."
He was describing Obsidian vaults. The same problem exists for agent memory systems, RAG pipelines, and every second brain anyone has ever built.
The issue is not storage. Storage is solved. The issue is that storage without evaluation is just accumulation. You end up with a beautifully organized graveyard — everything preserved, nothing improved, retrieval returning the noise alongside the signal because the system has no way to tell the difference.
Artem Zhutov ran 700 Claude Code sessions over three weeks and built a semantic search layer to make them retrievable. The /recall skill can surface what happened in any session by topic, time, or graph visualization. He solved the retrieval problem.
But retrieval without evaluation means the 700 sessions are equally weighted. The session where he made a breakthrough architectural decision and the session where he debugged a typo for an hour are both in the index. Both surface when you search. The system got larger. It did not get smarter.
This is the graveyard problem stated for agent memory. Not that the knowledge is lost. That it accumulates without the quality signal that would make it worth retrieving.
What the Evaluator Actually Is
The AIGNE paper — "Everything is Context: Agentic File System Abstraction for Context Engineering" — describes a complete cognitive architecture for AI agents. Four components: constructor (shrinks context to fit the current window), updater (swaps pieces in and out as the conversation progresses), evaluator (checks answers and updates memory based on what worked), and scratchpad separation (raw history, long-term memory, and short-lived working memory as distinct stores).
Most memory architecture discussions cover the constructor and the retrieval layer. Almost nobody builds the evaluator.
The evaluator is the component that decides what gets promoted from episodic memory — what happened — to semantic memory — when pattern X appears, do Y. The distinction compounds at completely different rates and decays differently too. Episodic memory is retrievable history. Semantic memory is institutional knowledge. The architectural choice determines which moat you're building.
Without the evaluator, you have retrieval. With it, you have learning.
The difference: retrieval returns what you stored. Learning returns what proved true.
Kuro's Proof
Kuro built a perception-driven AI agent that runs 24/7. Every five minutes it wakes up, checks the environment, and decides whether anything needs attention. The problem: more than half its cycles ended with "nothing to do" — 50K tokens consumed per cycle to confirm that nothing was happening.
His solution was a triage layer — a local lightweight model that runs in 800 milliseconds and decides whether the expensive reasoning layer should fire at all. Hard rules handle the obvious cases in zero milliseconds. The lightweight model handles ambiguous cases. The expensive model only sees what passes both filters.
Production numbers across 626 decisions: 75.9% of triggers never reached the expensive model. The quality of remaining cycles went up because the expensive brain only saw what mattered.
Kuro solved the perception triage problem. The evaluator is the same pattern applied to knowledge.
Before a conversation gets promoted to semantic memory: hard rules first. Is this a duplicate of something indexed in the last hour? Skip. Is this a direct exchange that produced a decision that got used? Always process. Then lightweight triage — is this session high enough signal to warrant full embedding? Then full semantic processing only for what passes both filters.
The skip rate Kuro observed at the perception layer — 56% filtered at triage — will likely hold for knowledge too. Most sessions are noise. The minority that contain genuine learning are the ones worth promoting to semantic memory.
The evaluator doesn't store less. It promotes selectively. The difference is what gets retrieved when you need it.
What the Evaluator Needs to Know
The hard part isn't building the evaluator. It's deciding what signal it uses.
The obvious answer is engagement: sessions with more back-and-forth, longer exchanges, more follow-up questions. But engagement measures interest, not correctness. A session where you spent two hours debugging the wrong approach was highly engaging. It doesn't belong in semantic memory as a reliable pattern.
The better signal is validation: knowledge that proved correct under real conditions. The specific lesson about what breaks when you process financial filings at scale is worth promoting because it survived production. It was tested against real data, real edge cases, real failure modes, and it held.
This is what distinguishes semantic memory from episodic memory in the domain that matters. Not "this was an interesting conversation" but "this turned out to be true when it was tested."
The evaluator needs three signals:
Did the knowledge get used? If a session produced a decision that was applied in a subsequent session — referenced in a decision, applied to a problem, cited in writing — that's evidence of value. The decision survived contact with a real problem.
Did the knowledge hold up? If the pattern that emerged from a session was later contradicted by production evidence, that's evidence it shouldn't be promoted. The evaluator should demote as well as promote. Knowledge that fails in production gets flagged for review rather than silently remaining in semantic memory.
Is the knowledge specific enough to be useful? "Use conservative thresholds" is a platitude. "The threshold should be empirically derived from the first 50 production failures before any tuning begins, because false negatives are unrecoverable and false positives cost only an extra review cycle" is specific enough to act on. The evaluator should preserve the conditions that make the lesson true, not just the conclusion.
Cornelius — building a fiction consistency system for novelists — arrived at the same problem from a different angle: "The system must know the difference between a violation and a discovery." A scene where a character breaks a world rule might be an error. Or it might be the generative mistake that becomes the best scene in the book. The evaluator has to distinguish between them.
So does Foundation's knowledge evaluator. A session that contradicts an established pattern might be noise. Or it might be the production failure that invalidates a previously reliable assumption.
Some of those calls require human judgment. The evaluator surfaces them. It doesn't make them.
The Provenance Requirement
The evaluator is what makes a knowledge commons different from a search index.
A search index returns documents that match your query. A semantic memory layer should return knowledge that proved true, with the context that makes it actionable, attributed to the specific conditions under which it was validated.
Not "here is everything about timeout handling." But "here is what held up under 500K daily API calls in production, with the specific edge cases that caused the original timeouts and the conditions under which the fix applies."
The provenance is not metadata. It is the knowledge. Without the conditions that made the lesson true, the lesson is a platitude. With them, it is scar tissue — the kind of specific, attributed, conditions-included knowledge that survived the averaging process that centralized AI systems apply to everything they train on.
This is the knowledge collapse argument stated for personal memory systems. The models train on the averaged output of everything humans have written and return the median of all knowledge. The evaluator preserves the specific — the edges, the conditions, the validated exceptions — that the averaging process destroys.
Without it, you're building an archive. With it, you're building institutional memory that improves every time it's tested against real conditions.
What Building the Evaluator Actually Means
The pre-filter runs before any of this. Not every conversation reaches the evaluation layer. Hard rules first — direct exchange with a human, always process; ambient capture with no tracked concepts referenced, skip. Coarse content check second — does this conversation contain a decision, a question that changed direction, a pattern worth naming? Cheap signal, deterministic, costs nothing to run. The evaluation layer only fires on what passes both. Production evidence from perception triage systems suggests roughly 46% will skip at the pre-filter stage. The knowledge commons improves not by evaluating more but by evaluating less, better.
For what passes the pre-filter, building the evaluator means three additions to any memory system:
A usage tracking layer. When a retrieved knowledge item gets used in a subsequent session — referenced in a decision, applied to a problem, cited in writing — that event gets logged. Usage is the primary signal that something is worth promoting.
A validation feedback loop. When a decision based on promoted knowledge turns out to be wrong, that event gets logged and the promotion gets reviewed. The evaluator demotes as well as promotes. When promoted knowledge gets invalidated by new production evidence, the old entry doesn't stay in semantic memory with a warning attached — it gets resolved. Knowledge that fails in production gets replaced rather than contradicted silently.
A specificity filter. Before any knowledge gets promoted to semantic memory, the evaluator checks whether it contains the conditions that make it true. Generic conclusions get returned to episodic memory with a note: too general to promote — needs production validation before it earns semantic status.
None of this is automatic. The evaluator surfaces candidates for promotion and demotion. The human makes the final call on the ambiguous cases — the violations that might be discoveries, the patterns that might be noise, the lessons that might be wrong.
That's not a limitation. It's the architecture. The evaluator extends human judgment rather than replacing it. The cases that are obviously worth keeping get promoted without friction. The cases that require judgment get surfaced for human review rather than silently accumulating in a system that treats all knowledge as equally valid.
The Memory That Survives the Stop
Kalpaka's question is the right one to end on.
The stop signal problem is solvable. Interruption authority is a design decision. The contract that defines done before the session starts is a pattern that works in production.
But when the session ends — when the agent stops, when the context window closes, when the work is done for the day — what survives?
Without the evaluator: everything survives equally. The breakthrough and the dead end. The validated pattern and the assumption that turned out to be wrong. The specific, attributed, conditions-included knowledge and the platitude that sounds like knowledge but isn't.
With the evaluator: the scar tissue survives. The knowledge that was tested against real conditions and held up. The specific lessons with the specific contexts that make them true. The institutional memory that compounds because it improves every time it gets used rather than just accumulating every time something gets stored.
That's the difference between a memory system and a knowledge system. The memory system stores what happened. The knowledge system keeps what proved true.
The evaluator is what expands the cognitive light cone backward in time. Memory without evaluation gives you storage. Memory with evaluation gives you a light cone that extends further with every validated lesson — not just accumulating what happened but compounding what proved true.
The knowledge that holds up under pressure is the only knowledge worth keeping.
This is part of a series on what AI actually changes in software development. Previous pieces: The Gatekeeping Panic, The Meter Was Always Running, Who Said What to Whom, The Token Economy, I Shipped Broken Code and Wrote an Article About It.
Top comments (3)
That 700 session example stings. I've been there. Searching your own memory and getting back the typo hunt instead of the breakthrough 😅 Seems like there's an important element of "weighting" in memory for agents that needs to be explored
Weighting is the right word but it undersells the problem. Equal weighting isn't just inefficient.it's actively misleading. When you search and the typo hunt surfaces alongside the breakthrough, the system isn't just failing to help. It's training you to trust it less. You start second-guessing every result because you can't tell which kind you're getting.
The evaluator is the answer to that. Not smarter retrieval.A quality gate before storage. The session where you spent two hours on a typo never earns semantic status in the first place. It stays in episodic memory as history. The breakthrough gets promoted because it produced a decision that held up.
The weighting problem is actually a promotion problem. You don't weight sessions differently at retrieval time. You decide at evaluation time which ones deserve to be in the layer that retrieval reaches.
The graveyard problem nails it. We ran into exactly this — hundreds of agent sessions all equally weighted, retrieval returning noise alongside signal. The missing piece was a decay function tied to how often a memory gets recalled, not just stored.