DEV Community

Cover image for Building the Evaluator
Daniel Nwaneri
Daniel Nwaneri

Posted on

Building the Evaluator

The sequel isn't about running or stopping. It's about whether the memory survives the stop.

That line came from a comment thread on The Token Economy. Someone named Kalpaka had been reading through the series — the stop signal problem, the authority over interruption argument, the architectural gap between what agents can do and what they do reliably and arrived at the question none of the pieces had answered.

You can solve the stop signal problem. You can build interruption authority into your architecture. You can define done before the session starts and give the agent a contract it has to satisfy before it terminates.

None of that answers what happens to the knowledge after the session ends.


The Graveyard Problem

Noah Vincent described it precisely for personal knowledge systems: "A week after consuming something, you could not explain what you learned if your life depended on it. The notes exist. The highlights are there. But you never use any of it."

He was describing Obsidian vaults. The same problem exists for agent memory systems, RAG pipelines, and every second brain anyone has ever built.

The issue is not storage. Storage is solved. The issue is that storage without evaluation is just accumulation. You end up with a beautifully organized graveyard — everything preserved, nothing improved, retrieval returning the noise alongside the signal because the system has no way to tell the difference.

Artem Zhutov ran 700 Claude Code sessions over three weeks and built a semantic search layer to make them retrievable. The /recall skill can surface what happened in any session by topic, time, or graph visualization. He solved the retrieval problem.

But retrieval without evaluation means the 700 sessions are equally weighted. The session where he made a breakthrough architectural decision and the session where he debugged a typo for an hour are both in the index. Both surface when you search. The system got larger. It did not get smarter.

This is the graveyard problem stated for agent memory. Not that the knowledge is lost. That it accumulates without the quality signal that would make it worth retrieving.


What the Evaluator Actually Is

The AIGNE paper — "Everything is Context: Agentic File System Abstraction for Context Engineering" — describes a complete cognitive architecture for AI agents. Four components: constructor (shrinks context to fit the current window), updater (swaps pieces in and out as the conversation progresses), evaluator (checks answers and updates memory based on what worked), and scratchpad separation (raw history, long-term memory, and short-lived working memory as distinct stores).

Most memory architecture discussions cover the constructor and the retrieval layer. Almost nobody builds the evaluator.

The evaluator is the component that decides what gets promoted from episodic memory — what happened — to semantic memory — when pattern X appears, do Y. The distinction compounds at completely different rates and decays differently too. Episodic memory is retrievable history. Semantic memory is institutional knowledge. The architectural choice determines which moat you're building.

Without the evaluator, you have retrieval. With it, you have learning.

The difference: retrieval returns what you stored. Learning returns what proved true.


Kuro's Proof

Kuro built a perception-driven AI agent that runs 24/7. Every five minutes it wakes up, checks the environment, and decides whether anything needs attention. The problem: more than half its cycles ended with "nothing to do" — 50K tokens consumed per cycle to confirm that nothing was happening.

His solution was a triage layer — a local lightweight model that runs in 800 milliseconds and decides whether the expensive reasoning layer should fire at all. Hard rules handle the obvious cases in zero milliseconds. The lightweight model handles ambiguous cases. The expensive model only sees what passes both filters.

Production numbers across 626 decisions: 75.9% of triggers never reached the expensive model. The quality of remaining cycles went up because the expensive brain only saw what mattered.

Kuro solved the perception triage problem. The evaluator is the same pattern applied to knowledge.

Before a conversation gets promoted to semantic memory: hard rules first. Is this a duplicate of something indexed in the last hour? Skip. Is this a direct exchange that produced a decision that got used? Always process. Then lightweight triage — is this session high enough signal to warrant full embedding? Then full semantic processing only for what passes both filters.

The skip rate Kuro observed at the perception layer — 56% filtered at triage — will likely hold for knowledge too. Most sessions are noise. The minority that contain genuine learning are the ones worth promoting to semantic memory.

The evaluator doesn't store less. It promotes selectively. The difference is what gets retrieved when you need it.


What the Evaluator Needs to Know

The hard part isn't building the evaluator. It's deciding what signal it uses.

The obvious answer is engagement: sessions with more back-and-forth, longer exchanges, more follow-up questions. But engagement measures interest, not correctness. A session where you spent two hours debugging the wrong approach was highly engaging. It doesn't belong in semantic memory as a reliable pattern.

The better signal is validation: knowledge that proved correct under real conditions. The specific lesson about what breaks when you process financial filings at scale is worth promoting because it survived production. It was tested against real data, real edge cases, real failure modes, and it held.

This is what distinguishes semantic memory from episodic memory in the domain that matters. Not "this was an interesting conversation" but "this turned out to be true when it was tested."

The evaluator needs three signals:

Did the knowledge get used? If a session produced a decision that was applied in a subsequent session — referenced in a decision, applied to a problem, cited in writing — that's evidence of value. The decision survived contact with a real problem.

Did the knowledge hold up? If the pattern that emerged from a session was later contradicted by production evidence, that's evidence it shouldn't be promoted. The evaluator should demote as well as promote. Knowledge that fails in production gets flagged for review rather than silently remaining in semantic memory.

Is the knowledge specific enough to be useful? "Use conservative thresholds" is a platitude. "The threshold should be empirically derived from the first 50 production failures before any tuning begins, because false negatives are unrecoverable and false positives cost only an extra review cycle" is specific enough to act on. The evaluator should preserve the conditions that make the lesson true, not just the conclusion.

Cornelius — building a fiction consistency system for novelists — arrived at the same problem from a different angle: "The system must know the difference between a violation and a discovery." A scene where a character breaks a world rule might be an error. Or it might be the generative mistake that becomes the best scene in the book. The evaluator has to distinguish between them.

So does Foundation's knowledge evaluator. A session that contradicts an established pattern might be noise. Or it might be the production failure that invalidates a previously reliable assumption.

Some of those calls require human judgment. The evaluator surfaces them. It doesn't make them.


The Provenance Requirement

The evaluator is what makes a knowledge commons different from a search index.

A search index returns documents that match your query. A semantic memory layer should return knowledge that proved true, with the context that makes it actionable, attributed to the specific conditions under which it was validated.

Not "here is everything about timeout handling." But "here is what held up under 500K daily API calls in production, with the specific edge cases that caused the original timeouts and the conditions under which the fix applies."

The provenance is not metadata. It is the knowledge. Without the conditions that made the lesson true, the lesson is a platitude. With them, it is scar tissue — the kind of specific, attributed, conditions-included knowledge that survived the averaging process that centralized AI systems apply to everything they train on.

This is the knowledge collapse argument stated for personal memory systems. The models train on the averaged output of everything humans have written and return the median of all knowledge. The evaluator preserves the specific — the edges, the conditions, the validated exceptions — that the averaging process destroys.

Without it, you're building an archive. With it, you're building institutional memory that improves every time it's tested against real conditions.


What Building the Evaluator Actually Means

The pre-filter runs before any of this. Not every conversation reaches the evaluation layer. Hard rules first — direct exchange with a human, always process; ambient capture with no tracked concepts referenced, skip. Coarse content check second — does this conversation contain a decision, a question that changed direction, a pattern worth naming? Cheap signal, deterministic, costs nothing to run. The evaluation layer only fires on what passes both. Production evidence from perception triage systems suggests roughly 46% will skip at the pre-filter stage. The knowledge commons improves not by evaluating more but by evaluating less, better.

For what passes the pre-filter, building the evaluator means three additions to any memory system:

A usage tracking layer. When a retrieved knowledge item gets used in a subsequent session — referenced in a decision, applied to a problem, cited in writing — that event gets logged. Usage is the primary signal that something is worth promoting.

A validation feedback loop. When a decision based on promoted knowledge turns out to be wrong, that event gets logged and the promotion gets reviewed. The evaluator demotes as well as promotes. When promoted knowledge gets invalidated by new production evidence, the old entry doesn't stay in semantic memory with a warning attached — it gets resolved. Knowledge that fails in production gets replaced rather than contradicted silently.

A specificity filter. Before any knowledge gets promoted to semantic memory, the evaluator checks whether it contains the conditions that make it true. Generic conclusions get returned to episodic memory with a note: too general to promote — needs production validation before it earns semantic status.

None of this is automatic. The evaluator surfaces candidates for promotion and demotion. The human makes the final call on the ambiguous cases — the violations that might be discoveries, the patterns that might be noise, the lessons that might be wrong.

That's not a limitation. It's the architecture. The evaluator extends human judgment rather than replacing it. The cases that are obviously worth keeping get promoted without friction. The cases that require judgment get surfaced for human review rather than silently accumulating in a system that treats all knowledge as equally valid.


The Memory That Survives the Stop

Kalpaka's question is the right one to end on.

The stop signal problem is solvable. Interruption authority is a design decision. The contract that defines done before the session starts is a pattern that works in production.

But when the session ends — when the agent stops, when the context window closes, when the work is done for the day — what survives?

Without the evaluator: everything survives equally. The breakthrough and the dead end. The validated pattern and the assumption that turned out to be wrong. The specific, attributed, conditions-included knowledge and the platitude that sounds like knowledge but isn't.

With the evaluator: the scar tissue survives. The knowledge that was tested against real conditions and held up. The specific lessons with the specific contexts that make them true. The institutional memory that compounds because it improves every time it gets used rather than just accumulating every time something gets stored.

That's the difference between a memory system and a knowledge system. The memory system stores what happened. The knowledge system keeps what proved true.

The evaluator is what expands the cognitive light cone backward in time. Memory without evaluation gives you storage. Memory with evaluation gives you a light cone that extends further with every validated lesson — not just accumulating what happened but compounding what proved true.

The knowledge that holds up under pressure is the only knowledge worth keeping.


This is part of a series on what AI actually changes in software development. Previous pieces: The Gatekeeping Panic, The Meter Was Always Running, Who Said What to Whom, The Token Economy, I Shipped Broken Code and Wrote an Article About It.

Top comments (13)

Collapse
 
apex_stack profile image
Apex Stack

The graveyard problem maps perfectly to large-scale content systems too, not just agent memory.

I run a programmatic site with 100k+ pages covering financial data across 12 languages. The exact same dynamic plays out: storage is trivial — I can generate a page for every stock ticker. But accumulation without evaluation means Google crawls 50,000 pages and rejects them because the system has no quality gate distinguishing genuine analytical depth from data dumps.

Your three evaluation signals translate almost 1:1:

Did the knowledge get used? → Did the page actually get clicked from search? Pages with zero engagement after 90 days are signaling they don't deserve promotion.

Did the knowledge hold up? → Did the analysis prove accurate vs actual market performance? A stock page that called the trend correctly has validated knowledge.

Is it specific enough? → Generic "Company X is a technology company" pages are platitudes. Pages with specific extractable claims — actual P/E ratios vs sector averages, dividend yield trends — are worth keeping.

The pre-filter insight from Kuro's triage is the part I wish I'd implemented earlier. Hard rules first, then lightweight quality checks, then full analysis only for what passes both.

The "promotion vs retrieval" framing is the key takeaway. You don't fix a bloated system by building better search over it. You fix it by being more selective about what earns its place.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

The translation is sharper than I expected. I wrote the 3 signals for agent memory and you applied them to 100k SEO pages and they mapped 1:1 which means the underlying problem isn't specific to AI sessions. It's any system where generation is cheap and evaluation is an afterthought.

The 90-day engagement signal is the one I'd push on though. Zero clicks from search tells you Google didn't surface it or users didn't choose it but those are different problems. A page that ranks but doesn't get clicked has a title/meta problem. A page that never gets crawled has a promotion problem. Neither is quite the same as "This knowledge didn't hold up under real conditions."

The validation signal in your domain might be closer to: did the page that called a trend accurately get cited by others, linked to, returned to? That's the equivalent of a memory that survives production. Engagement is a proxy. Accuracy under real conditions is the thing itself.
What does your current signal actually catch — pages that never get traffic, or pages that get traffic but don't convert to anything?

Collapse
 
apex_stack profile image
Apex Stack

You're right to split those failure modes — I was conflating them and your distinction changes how you diagnose the problem.

To answer your question directly: right now our signal mostly catches the first category. 51,000 pages crawled but not indexed means Google looked at them and said "not worth keeping." Another 28,000 discovered but never crawled — that's pure promotion/authority deficit. Only 1,920 made it through, and of those, we've gotten 3 clicks in 3 months. So we're failing at every stage of your funnel simultaneously.

The accuracy-under-real-conditions framing is what I've been missing though. Our current quality gate is basically "does this page have enough content and correct data?" But that's a generation quality check, not an evaluation. The real validation signal would be: does the analysis on this stock page surface something a human analyst would agree with? We can pull P/E ratios and dividend yields all day, but if the narrative connecting them is generic, the page is technically correct and practically useless.

That maps back to your point about specificity. The pages that DO get indexed tend to be ones where the generated analysis makes a non-obvious connection — comparing a stock's metrics to its sector average in a way that actually means something. The generic ones get crawled and rejected. Google is basically running your evaluator for us, just very slowly and without explaining its reasoning.

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

Google is basically running your evaluator for us" . That's the sharpest thing in this thread.
The implication. Google's signal is slow and opaque but it's validated against real searcher intent at scale. More rigorous than any internal gate you'd build cheaply. The problem is the feedback loop is months long. By the time the rejection comes, you've generated ten thousand pages with the same pattern.
The evaluator you need runs before indexing. Not "does this page have enough content". you know that's wrong. The question is whether this analysis says something a competing page at the same query doesn't. That's detectable at generation time if the sector comparison is explicitly in the prompt rather than emerging when it happens to.

Thread Thread
 
apex_stack profile image
Apex Stack

This is exactly the insight I needed to hear. You've reframed the problem completely — the evaluator shouldn't be checking "is this page good enough?" post-hoc, it should be answering "does this page add something the top 5 results don't?" at generation time.

That's a fundamentally different prompt architecture. Instead of generating stock analysis in isolation, you'd scrape the top-ranking pages for the target query, extract their key claims, and then instruct the LLM to cover those plus identify gaps. The comparison becomes the input, not the evaluation.

The months-long feedback loop is the killer. I generated 50k pages with the same template before GSC data even started showing the pattern. By then the damage to crawl budget was already done. Building the competitive comparison into the generation prompt would have caught it upfront.

Really appreciate you pushing this thread deeper — this is the kind of thinking that changes how I approach the next content generation cycle.

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

The comparison-as-input flip is the right move. Let us know how it holds up in the next cycle.That's the production test the framing hasn't had yet...

Collapse
 
theycallmeswift profile image
Swift

That 700 session example stings. I've been there. Searching your own memory and getting back the typo hunt instead of the breakthrough 😅 Seems like there's an important element of "weighting" in memory for agents that needs to be explored

Collapse
 
dannwaneri profile image
Daniel Nwaneri

Weighting is the right word but it undersells the problem. Equal weighting isn't just inefficient.it's actively misleading. When you search and the typo hunt surfaces alongside the breakthrough, the system isn't just failing to help. It's training you to trust it less. You start second-guessing every result because you can't tell which kind you're getting.
The evaluator is the answer to that. Not smarter retrieval.A quality gate before storage. The session where you spent two hours on a typo never earns semantic status in the first place. It stays in episodic memory as history. The breakthrough gets promoted because it produced a decision that held up.
The weighting problem is actually a promotion problem. You don't weight sessions differently at retrieval time. You decide at evaluation time which ones deserve to be in the layer that retrieval reaches.

Collapse
 
klement_gunndu profile image
klement Gunndu

The graveyard problem nails it. We ran into exactly this — hundreds of agent sessions all equally weighted, retrieval returning noise alongside signal. The missing piece was a decay function tied to how often a memory gets recalled, not just stored.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

The decay function is the right instinct but it measures the wrong thing. Recall frequency tells you what gets used. it doesn't tell you what held up. A memory that gets recalled constantly but turns out to be wrong is worse than one that sits unused for months and proves correct the one time it matters.
The signal that should drive decay isn't recall rate. It's validation rate. How often does retrieving this memory produce a decision that turns out to be correct? A memory with low recall but high validation should decay slowly.

A memory with high recall but repeated contradictions by production evidence should decay fast or get demoted entirely rather than just weighted lower.
Recall frequency is a proxy for value. Validation is the thing itself.
What were you using to track whether the recalled memory actually helped?

Collapse
 
dannwaneri profile image
Daniel Nwaneri

The decay function is the right instinct but it measures the wrong thing. Recall frequency tells you what gets used. it doesn't tell you what held up. A memory that gets recalled constantly but turns out to be wrong is worse than one that sits unused for months and proves correct the one time it matters.
The signal that should drive decay isn't recall rate. It's validation rate. How often does retrieving this memory produce a decision that turns out to be correct?

A memory with low recall but high validation should decay slowly. A memory with high recall but repeated contradictions by production evidence should decay fast or get demoted entirely rather than just weighted lower.
Recall frequency is a proxy for value. Validation is the thing itself.
What were you using to track whether the recalled memory actually helped?

Collapse
 
leob profile image
leob

I guess it's like the difference between "data" and "information" :-)

Collapse
 
dannwaneri profile image
Daniel Nwaneri

Closer to the difference between information and knowledge that survived being wrong.