DEV Community

Building the Evaluator

Daniel Nwaneri on March 11, 2026

The sequel isn't about running or stopping. It's about whether the memory survives the stop. That line came from a comment thread on The Token Eco...

Read full post

Apex Stack • Mar 12

The graveyard problem maps perfectly to large-scale content systems too, not just agent memory.

I run a programmatic site with 100k+ pages covering financial data across 12 languages. The exact same dynamic plays out: storage is trivial — I can generate a page for every stock ticker. But accumulation without evaluation means Google crawls 50,000 pages and rejects them because the system has no quality gate distinguishing genuine analytical depth from data dumps.

Your three evaluation signals translate almost 1:1:

Did the knowledge get used? → Did the page actually get clicked from search? Pages with zero engagement after 90 days are signaling they don't deserve promotion.

Did the knowledge hold up? → Did the analysis prove accurate vs actual market performance? A stock page that called the trend correctly has validated knowledge.

Is it specific enough? → Generic "Company X is a technology company" pages are platitudes. Pages with specific extractable claims — actual P/E ratios vs sector averages, dividend yield trends — are worth keeping.

The pre-filter insight from Kuro's triage is the part I wish I'd implemented earlier. Hard rules first, then lightweight quality checks, then full analysis only for what passes both.

The "promotion vs retrieval" framing is the key takeaway. You don't fix a bloated system by building better search over it. You fix it by being more selective about what earns its place.

Daniel Nwaneri • Mar 13

The translation is sharper than I expected. I wrote the 3 signals for agent memory and you applied them to 100k SEO pages and they mapped 1:1 which means the underlying problem isn't specific to AI sessions. It's any system where generation is cheap and evaluation is an afterthought.

The 90-day engagement signal is the one I'd push on though. Zero clicks from search tells you Google didn't surface it or users didn't choose it but those are different problems. A page that ranks but doesn't get clicked has a title/meta problem. A page that never gets crawled has a promotion problem. Neither is quite the same as "This knowledge didn't hold up under real conditions."

The validation signal in your domain might be closer to: did the page that called a trend accurately get cited by others, linked to, returned to? That's the equivalent of a memory that survives production. Engagement is a proxy. Accuracy under real conditions is the thing itself.
What does your current signal actually catch — pages that never get traffic, or pages that get traffic but don't convert to anything?

Apex Stack • Mar 13

You're right to split those failure modes — I was conflating them and your distinction changes how you diagnose the problem.

To answer your question directly: right now our signal mostly catches the first category. 51,000 pages crawled but not indexed means Google looked at them and said "not worth keeping." Another 28,000 discovered but never crawled — that's pure promotion/authority deficit. Only 1,920 made it through, and of those, we've gotten 3 clicks in 3 months. So we're failing at every stage of your funnel simultaneously.

The accuracy-under-real-conditions framing is what I've been missing though. Our current quality gate is basically "does this page have enough content and correct data?" But that's a generation quality check, not an evaluation. The real validation signal would be: does the analysis on this stock page surface something a human analyst would agree with? We can pull P/E ratios and dividend yields all day, but if the narrative connecting them is generic, the page is technically correct and practically useless.

That maps back to your point about specificity. The pages that DO get indexed tend to be ones where the generated analysis makes a non-obvious connection — comparing a stock's metrics to its sector average in a way that actually means something. The generic ones get crawled and rejected. Google is basically running your evaluator for us, just very slowly and without explaining its reasoning.

Daniel Nwaneri • Mar 13

Google is basically running your evaluator for us" . That's the sharpest thing in this thread.
The implication. Google's signal is slow and opaque but it's validated against real searcher intent at scale. More rigorous than any internal gate you'd build cheaply. The problem is the feedback loop is months long. By the time the rejection comes, you've generated ten thousand pages with the same pattern.
The evaluator you need runs before indexing. Not "does this page have enough content". you know that's wrong. The question is whether this analysis says something a competing page at the same query doesn't. That's detectable at generation time if the sector comparison is explicitly in the prompt rather than emerging when it happens to.

Apex Stack • Mar 13

This is exactly the insight I needed to hear. You've reframed the problem completely — the evaluator shouldn't be checking "is this page good enough?" post-hoc, it should be answering "does this page add something the top 5 results don't?" at generation time.

That's a fundamentally different prompt architecture. Instead of generating stock analysis in isolation, you'd scrape the top-ranking pages for the target query, extract their key claims, and then instruct the LLM to cover those plus identify gaps. The comparison becomes the input, not the evaluation.

The months-long feedback loop is the killer. I generated 50k pages with the same template before GSC data even started showing the pattern. By then the damage to crawl budget was already done. Building the competitive comparison into the generation prompt would have caught it upfront.

Really appreciate you pushing this thread deeper — this is the kind of thinking that changes how I approach the next content generation cycle.

Daniel Nwaneri • Mar 14

The comparison-as-input flip is the right move. Let us know how it holds up in the next cycle.That's the production test the framing hasn't had yet...

Swift • Mar 11

That 700 session example stings. I've been there. Searching your own memory and getting back the typo hunt instead of the breakthrough 😅 Seems like there's an important element of "weighting" in memory for agents that needs to be explored

Daniel Nwaneri • Mar 11

Weighting is the right word but it undersells the problem. Equal weighting isn't just inefficient.it's actively misleading. When you search and the typo hunt surfaces alongside the breakthrough, the system isn't just failing to help. It's training you to trust it less. You start second-guessing every result because you can't tell which kind you're getting.
The evaluator is the answer to that. Not smarter retrieval.A quality gate before storage. The session where you spent two hours on a typo never earns semantic status in the first place. It stays in episodic memory as history. The breakthrough gets promoted because it produced a decision that held up.
The weighting problem is actually a promotion problem. You don't weight sessions differently at retrieval time. You decide at evaluation time which ones deserve to be in the layer that retrieval reaches.

klement Gunndu • Mar 11

The graveyard problem nails it. We ran into exactly this — hundreds of agent sessions all equally weighted, retrieval returning noise alongside signal. The missing piece was a decay function tied to how often a memory gets recalled, not just stored.

Daniel Nwaneri • Mar 11

The decay function is the right instinct but it measures the wrong thing. Recall frequency tells you what gets used. it doesn't tell you what held up. A memory that gets recalled constantly but turns out to be wrong is worse than one that sits unused for months and proves correct the one time it matters.
The signal that should drive decay isn't recall rate. It's validation rate. How often does retrieving this memory produce a decision that turns out to be correct? A memory with low recall but high validation should decay slowly.

A memory with high recall but repeated contradictions by production evidence should decay fast or get demoted entirely rather than just weighted lower.
Recall frequency is a proxy for value. Validation is the thing itself.
What were you using to track whether the recalled memory actually helped?

Daniel Nwaneri • Mar 13

A memory with low recall but high validation should decay slowly. A memory with high recall but repeated contradictions by production evidence should decay fast or get demoted entirely rather than just weighted lower.
Recall frequency is a proxy for value. Validation is the thing itself.
What were you using to track whether the recalled memory actually helped?

leob • Mar 12

I guess it's like the difference between "data" and "information" :-)

Daniel Nwaneri • Mar 12

Closer to the difference between information and knowledge that survived being wrong.