Every few weeks someone opens a ticket that says some version of "I think the AI is getting worse?" The answers are still fluent, still confident, still cited. They're just subtly wrong, often enough that people notice and rarely enough that nothing obviously breaks. Then a few days quietly disappear into it.
The instinct is always to look at the model or the prompt. Almost every time I've chased one of these, the model did exactly what it was told. It read the top documents and answered from them. The problem was upstream, in what got retrieved and handed to it, and the reason it took days to find is that the retrieval step was a black box. We log the final answer. Sometimes we log the citations. We almost never log what the retriever actually saw and chose between.
You can't debug what you didn't instrument.
What to actually log
For every answer, I keep a small retrieval manifest next to it. Three things:
- What was retrieved. The whole candidate set with scores, not just the ones that got cited. This is the part you'd expect.
- What was excluded, and why. Each dropped candidate with a reason code: below the rank cutoff, filtered out by metadata, superseded or stale, out of license, deduplicated. This is the part nobody logs, and it's exactly where the blind spots live.
- What was cited. What actually made it into the answer.
Here is roughly the shape of one entry:
{
"query": "what is our refund window for enterprise?",
"retrieved": [
{"id": "policy-2024-11", "score": 0.86, "cited": true},
{"id": "policy-2026-05", "score": 0.78, "cited": false}
],
"excluded": [
{"id": "policy-2026-05-draft", "reason": "status:superseded"},
{"id": "sales-deck-q1", "reason": "below_rank_cutoff"}
]
}
Look at that for a second. The cited document is fourteen months old and scored higher than the current one, purely because it happened to be written more cleanly. In the answer, that is invisible. In the manifest, it is the first thing you see.
What it buys you
Two things that used to be guesswork become mechanical.
You can tell a reasoning problem from an evidence problem. When two runs disagree, or two deployments of the same model give different answers, diff the manifests first. Same evidence set and different answers means it is the model or nondeterminism. Different evidence sets means it is retrieval, and you were never going to fix that by tweaking the prompt. Right now most people debug this backwards, staring at the outputs, because the boundary was never captured.
The stale-document bug surfaces in minutes instead of days. The classic failure, where an outdated doc quietly outranks the current one, does not show up in the answer at all. It shows up immediately in the manifest as a top result with an old timestamp. You stop guessing and start reading.
The part people get wrong
The exclusion log is noisy. You are not going to read it on every query, and if you try you will drown. So log it always, surface it only when an answer gets flagged or when two results disagree. It is a black box recorder, not a dashboard.
The other trap is drift. The manifest only helps if the retrieval code emits it as it runs. The moment you rebuild it after the fact, or maintain it by hand, it becomes one more thing that can quietly disagree with reality, and now you are debugging your debugging.
The one-line version
Citations tell you what supported the answer. The exclusion log tells you what the answer was blind to. You need both to trust the thing, and almost everyone keeps only the first.
Most "the model is hallucinating" tickets are really "the retriever handed it the wrong evidence and it used it faithfully." Instrument the boundary and the model stops being the default suspect. That is the direction I have been building rag-quality around, the idea that the retrieval step should measure and report on itself instead of being trusted on faith.
So I am curious: what do you actually log from your retriever today? Just the citations, the full candidate set, or nothing until something breaks?
Top comments (0)