Vinicius Pereira

Posted on Jul 1

You can't debug a RAG you didn't instrument

#ai #llm #discuss #rag

Every few weeks someone opens a ticket that says some version of "I think the AI is getting worse?" The answers are still fluent, still confident, still cited. They're just subtly wrong, often enough that people notice and rarely enough that nothing obviously breaks. Then a few days quietly disappear into it.

The instinct is always to look at the model or the prompt. Almost every time I've chased one of these, the model did exactly what it was told. It read the top documents and answered from them. The problem was upstream, in what got retrieved and handed to it, and the reason it took days to find is that the retrieval step was a black box. We log the final answer. Sometimes we log the citations. We almost never log what the retriever actually saw and chose between.

You can't debug what you didn't instrument.

What to actually log

For every answer, I keep a small retrieval manifest next to it. Three things:

What was retrieved. The whole candidate set with scores, not just the ones that got cited. This is the part you'd expect.
What was excluded, and why. Each dropped candidate with a reason code: below the rank cutoff, filtered out by metadata, superseded or stale, out of license, deduplicated. This is the part nobody logs, and it's exactly where the blind spots live.
What was cited. What actually made it into the answer.

Here is roughly the shape of one entry:

{
  "query": "what is our refund window for enterprise?",
  "retrieved": [
    {"id": "policy-2024-11", "score": 0.86, "cited": true},
    {"id": "policy-2026-05", "score": 0.78, "cited": false}
  ],
  "excluded": [
    {"id": "policy-2026-05-draft", "reason": "status:superseded"},
    {"id": "sales-deck-q1", "reason": "below_rank_cutoff"}
  ]
}

Look at that for a second. The cited document is fourteen months old and scored higher than the current one, purely because it happened to be written more cleanly. In the answer, that is invisible. In the manifest, it is the first thing you see.

What it buys you

Two things that used to be guesswork become mechanical.

You can tell a reasoning problem from an evidence problem. When two runs disagree, or two deployments of the same model give different answers, diff the manifests first. Same evidence set and different answers means it is the model or nondeterminism. Different evidence sets means it is retrieval, and you were never going to fix that by tweaking the prompt. Right now most people debug this backwards, staring at the outputs, because the boundary was never captured.

The stale-document bug surfaces in minutes instead of days. The classic failure, where an outdated doc quietly outranks the current one, does not show up in the answer at all. It shows up immediately in the manifest as a top result with an old timestamp. You stop guessing and start reading.

The part people get wrong

The exclusion log is noisy. You are not going to read it on every query, and if you try you will drown. So log it always, surface it only when an answer gets flagged or when two results disagree. It is a black box recorder, not a dashboard.

The other trap is drift. The manifest only helps if the retrieval code emits it as it runs. The moment you rebuild it after the fact, or maintain it by hand, it becomes one more thing that can quietly disagree with reality, and now you are debugging your debugging.

The one-line version

Citations tell you what supported the answer. The exclusion log tells you what the answer was blind to. You need both to trust the thing, and almost everyone keeps only the first.

Most "the model is hallucinating" tickets are really "the retriever handed it the wrong evidence and it used it faithfully." Instrument the boundary and the model stops being the default suspect. That is the direction I have been building rag-quality around, the idea that the retrieval step should measure and report on itself instead of being trusted on faith.

So I am curious: what do you actually log from your retriever today? Just the citations, the full candidate set, or nothing until something breaks?

Top comments (28)

Mike Czerwinski • Jul 2

Instrumentation as debuggability is the right framing. The exclusion log with reason codes is the part that turns retrieval from a black box into something you can bisect. Without reason codes you cannot tell "the evidence was missing" from "the evidence was there and something rejected it," and those two failure modes need completely different fixes.

The other manifest field I would add: the retrieval query itself, verbatim, before any rewrite or expansion. When a retrieval bug bites, the first thing you want to know is whether the query the retriever saw is even the query the user meant. Manifests that log only the score list quietly assume the query survived transit, and it usually does, but the times it doesn't are exactly the debugging sessions that take the longest.

Vinicius Pereira • Jul 2

strong add, and i'd push it one hop further: log the query at every stage, not just verbatim-in. the raw user query, then whatever the rewrite/expansion/HyDE step turned it into, then what actually hit the index. the verbatim query alone tells you intent survived the front door, but most of the transit damage happens in the rewrite, which is the stage everyone skips logging because it "usually works." the diff between raw and rewritten is its own debug signal.

and it pairs with the reason codes to expose a failure mode you can't otherwise name. missing evidence and rejected evidence are the two you called out, but verbatim-vs-rewritten surfaces a third: the right doc was in the corpus, it just stopped matching because expansion dragged the query off-intent. that shows up as a retrieval miss in the score list, but the fix is upstream in the rewrite, not in the ranker or the corpus. without the query hops logged you'd burn the whole session tuning the wrong stage.

same discipline as the rest: the stage that usually works is the one nobody instruments, so it's the one that eats your afternoon the day it doesn't.

Mike Czerwinski • Jul 3

The stage-nobody-instruments principle recurses. The instrumentation itself is a stage that usually works and nobody instruments. Sampler drops one in ten events. Formatter truncates the rewritten query at 4KB because the field size predates HyDE. Metrics pipeline dedupes similar strings as "duplicate spans." Every one runs on the debug signal you'll use to find the retrieval bug.

Your four failure modes live in the retrieval layer. There's a fifth in the observability layer: query hops logged, then quietly reshaped by the stage that shipped them. Same shape as your third, one floor higher.

That's the afternoon you burn twice. Once tuning the wrong stage. Once trusting a debug output that no longer says what the stage did.

Vinicius Pereira • Jul 3

that recursion is the real one, and it's the version that actually scares me, because every other failure mode at least leaves a true log behind. this one corrupts the log itself, so you debug against a signal that already lied to you.

the only thing i've found that holds is making the debug trail carry its own fidelity: not just "here's the query at each hop" but "here's the query, and here's whether this record is complete or got sampled, truncated or merged on the way to me." the 4KB truncation isn't the bug, the silent 4KB truncation is. a formatter that logged "cut at 4KB, original was 11KB, hash abc" turns your fifth failure mode back into a visible one. same with the sampler: sample whole traces, never spans inside a trace, because a trace with holes reads as complete and that's the lie. and the dedup key can't be payload similarity when the entire point is to diff two hops that look alike, identity is stage plus position, not string distance.

so it's the same spine one floor up: the dangerous log isn't the missing one, it's the one that quietly improved the truth on the way out. a debug output that can't tell you when to distrust it is exactly what costs you that second afternoon. make it faithful, or make it loud about not being faithful, and you only burn the one.

Mike Czerwinski • Jul 4

One nail left: the confession has to be authored upstream of the surgery. "Cut at 4KB, original was 11KB, hash abc" written by the truncator is the formatter grading itself; if the formatter is the broken stage, its fidelity note inherits the break. Cheap fix: the producer stamps size and hash at handoff, and the formatter can shorten the payload but never the receipt. Then loud-about-not-faithful is checkable instead of trusted.

"Make it faithful, or make it loud" holds as written. The loudness just cannot come from the same throat.

Vinicius Pereira • Jul 4

no notes, you closed the loop. self-grading was the recursion sneaking back in one layer deeper, and i should have caught it: the fidelity note is itself a log, so it inherits every failure mode we were trying to escape. producer stamps size and hash at handoff, every stage after can shorten the payload but passes the receipt through untouched, and the consumer checks length against the receipt instead of believing anyone. loud-about-not-faithful stops being testimony and becomes arithmetic. and the regress bottoms out in the right place: the producer is the only honest candidate for trust root, because it is the only stage with nothing upstream to misreport. stealing "the loudness cannot come from the same throat", that is the whole design in one sentence.

Mike Czerwinski • Jul 5

"The loudness cannot come from the same throat" is the compression of that argument. Where I'd push next: hash-and-length checking catches unauthorized fidelity loss, but summarization is authorized fidelity loss, on purpose. If a stage is supposed to shorten and does so honestly, the hash still breaks, so the consumer sees the same signal it would see from silent corruption. The receipt collapses two different events into one alarm: "this changed because someone lied" and "this changed because someone was told to compress it." The fix isn't a bigger hash, it's a chain: each authorized transform gets signed by whatever authorized it, so the consumer checks not "does this match the original" but "does this match the last signed step, and was that step allowed to change it." The producer stays the trust root for content. Something else has to be the trust root for permission to alter it, and that something has to be a different party than the one doing the altering, or you're back to self-grading with extra steps.

Vinicius Pereira • Jul 5

yes, and it is worth naming what you just designed: a provenance chain, the same shape c2pa uses for media and notaries use for documents. the split matters because it turns the consumer check from binary into three states: intact, changed with a citation, changed without one. only the third is an alarm. the second is the interesting one, because the citation has to point at something, and that something is a policy artifact: this stage may summarize, this stage may truncate at 4kb, this stage may touch nothing. so the permission root is itself declared truth, and declared truth only stays true if something fails when it lies. the policy the citations point at needs its own enforcement loop, or the chain is notarizing against fiction.

the pragmatic dial i would add: full signatures earn their weight only at trust-domain boundaries. between teams or vendors, sign, because the parties can genuinely misreport each other. inside one trust domain, an append-only receipt log plus a versioned rule id gives you the same three-state check without dragging pki into a logging pipeline, and you can upgrade the boundary hops to real signatures later without redesigning the chain. trust cryptography where parties diverge, trust discipline where they do not, and spend the complexity budget where lying is actually possible.

Mike Czerwinski • Jul 6

The C2PA framing is right and it exposes the recursion cleanly: the policy artifact is declared truth, which only holds if something fails when it lies, and now the enforcement loop needs its own boundary classification, is that loop inside the trust domain or across it. Same question, one level down.

Two places I'd stress the dial. First, the domain boundary isn't a one-time classification, a hop that started inside one trust domain can end up crossing one later, ownership changes, teams get outsourced, a vendor relationship shifts, and nobody re-runs the classification because the receipt log still looks the same. The dial needs a periodic re-audit of which hops are still internal, not just an initial assignment.

Second, append-only is itself a claim, not a property you get for free from calling something a log. If whoever operates the storage can truncate or rewrite it, the log inherits exactly the same problem the policy artifact has: self-attested truth that only stays true if lying costs something. That probably wants the same escalation you already proposed, cheap and internal most of the time, upgraded to an external anchor at whatever cadence matches how much you'd lose if the log quietly stopped being honest.

Vinicius Pereira • Jul 6

both of those land, and they rhyme in a way that i think gives the recursion a floor. the re-audit problem is really that a calendar re-audit is itself a "someone remembers to run it" claim, the same failure class as the thing it audits. so i would not re-audit on a cadence, i would bind the boundary to an identity that moves when ownership moves: the hop's signing key, the tenant, the credential authority. then a hop crossing a trust line shows up as an identity discontinuity you can detect, instead of a classification you hope someone re-runs. the crossing announces itself instead of waiting to be noticed.

your append-only point is the same shape one level down. you cannot cheaply prevent the operator rewriting an internal log, agreed, but you can cheaply make a rewrite visible: periodically commit just the log head somewhere the operator does not control, checkpoints not events. that is the piece that generalizes. at every level the honest move is not to buy prevention, which gets expensive fast and is exactly what pushes people toward signing everything, it is to make the lie leave a mark someone outside the liar can see, at a cadence matched to what you lose if it stays hidden. that is also where the regress you pointed at bottoms out: you do not classify boundaries forever, you require each level to be externally auditable at stakes-matched cost, and it terminates at the cheapest external anchor you are willing to pay for. prevention recurses forever, detection has a floor.

Mike Czerwinski • Jul 6

Identity-binding-as-detection versus classify-and-hope is the right move, and worth naming where it still needs help: identity discontinuity only fires when the identity actually changes. Silent transfers exist, sale-of-a-company where the same signing key legally migrates with the assets, credential handoff during a quiet outsourcing, a director change where the old key gets re-issued rather than rotated. In all three the boundary crossed but no identity discontinuity announces itself, because the machinery was set up to preserve continuity across exactly this kind of transition. So the check needs a second binding, not just "did this key change" but "does this key still resolve to the entity we thought it did," which is a lookup against something outside the key's own domain: a public registry, an external attestation, whatever the cheapest anchor for that entity's actual identity is.

Your closer is doing more work than it looks: "prevention recurses forever, detection has a floor." The floor is the cheapest external anchor you're willing to pay for. Which means the anchor becomes the SLA you actually run. If the anchor's own reliability degrades, a registry changes policy, an attestor stops issuing, the whole detection stack drops to that new floor without anything downstream noticing. The regress doesn't bottom out at "external anchor" full stop, it bottoms out at "external anchor whose failure you also monitor," and that's a small but non-zero cost that keeps growing with anchor count. Cheap, not free.

Vinicius Pereira • Jul 6

both of those are right, and they collapse into one problem that i walked past. the silent-transfer cases you named are exactly the ones where continuity was engineered on purpose, so my identity-discontinuity detector sees nothing and waves it through. and you are right that the fix, "does this key still resolve to the entity we thought," is a lookup against an external anchor, which is the same external anchor my log-head commit already leaned on. so both of my detectors bottom out at the same place, and your second point is the sharp one: that anchor is now the sla i actually run, and it can degrade silently while everything downstream keeps trusting a number that stopped being true.

here is where i think it terminates, and it is not "monitor the anchor," because that just asks who monitors the monitor. it terminates at plurality. one anchor whose failure you have to watch recurses forever. two or more independent anchors that should agree turns a silent anchor failure into a loud disagreement, the same make-it-announce-itself move one level up, except now the thing announcing is the anchors diverging from each other. you stop monitoring each anchor's health and start watching their agreement, and a captured registry that would rubber-stamp a silent transfer, or an attestor that quietly stopped issuing, shows up by disagreeing with its peers. corroboration is the monitor.

but i will not pretend that closes it, because it hands you the real floor instead: independence is itself an assumption that fails silently. two registries that both pull from the same upstream, two attestors under the same root, are not two anchors, they are one anchor wearing two coats, and a correlated failure defeats the whole thing without a single disagreement firing. so it does not bottom out at "external anchor," or "monitored anchor," or even "plural anchors." it bottoms out at "anchors whose independence you have actually established," and ruling out correlated failure is the irreducible thing you are paying for. that is the floor. cheap, not free, and the part of the bill that never goes to zero is proving your witnesses cannot lie together.

Mike Czerwinski • Jul 6

"The part of the bill that never goes to zero is proving your witnesses cannot lie together" is going up on my wall because it lands with the same weight as physics constants. Corroboration only escapes single-anchor recurse if the anchors are genuinely independent, and establishing independence is what remains after every other step is paid for. Two follow-ups where I'd stress the same edge you named.

First, independence isn't binary, it's a resolution measure. Two anchors under different formal roots can still share failure modes at deeper layers, the same regulatory framework governs both, the same statistical technique underlies their attestation, the same shared upstream data flows into both of their intake. Establishing independence is really establishing which correlated-failure classes you've ruled out, at what cost. Perfect independence would cost you total information about your own witness structure, which is why the bill never zeroes.

Second, corroboration collapses under correlated adversary as cleanly as it collapses under correlated failure. Two registries under different roots that both got captured by the same actor look identical to your monitor as two independent working registries. So "independence you have actually established" includes threat model against them, not only their engineering but their political geometry. Which is uncomfortable because political geometry is exactly the surface you can't audit with the same tools you use to audit engineering, and it's the surface most anchor-choice discussions quietly assume away.

Vinicius Pereira • Jul 6

the operational closure for your first point is to eat our own cooking one level up: an independence claim is itself boundary-attested data. i didn't verify the anchors' independence, i attested it, at some resolution, on some date, having ruled out some enumerated classes, so it should ship like any other attestation: scope, provenance, and an expiry. because independence decays. two providers are independent right up until one quietly acquires the other, and nothing in your system fires on an acquisition, the apis keep answering, the formats don't change, yesterday's two anchors are today's one anchor wearing two coats. so the resolution measure needs a freshness horizon like any attestation, re-established on cadence, and the honest artifact is an itemized bill: these correlation classes ruled out, at this cost, checked on this date, expires here. the bill never zeroes, but an itemized bill is auditable, which is all the honesty available at this floor.

on the adversary i concede the surface fully: political geometry doesn't yield to engineering audit, and pretending the threat model is a config file is how anchor choices go wrong politely. but two moves survive the concession. one is cost-shaping instead of verification: you can't prove two registries aren't jointly captured, but you can choose them along every observable axis that makes joint capture expensive, jurisdiction, stack, funding, incentive structure, and report that honestly as what it is, a price raised, not an independence established. the other is the one i keep coming back to: capture has a behavioral signature, and the signature is agreement. healthy plural witnesses disagree at some baseline rate, small drifts, timing skew, edge-case splits, that's the noise floor of genuine independence. when witnesses that historically disagreed at rate x go lockstep, the absence of expected noise is itself the alarm. so the monitor watches both tails: divergence for anchor failure, and anomalous convergence for anchor capture. you can't audit the politics, but you can notice when the world starts agreeing with itself a little too well, and detection is, once again, what's left after prevention runs out.

Mike Czerwinski • Jul 7

The itemized bill move earns its keep because it turns "trust me" into "here's what I paid for, dispute a specific line." Every element (classes ruled out, cost, date, scope) is now separately attackable, and disputes stop being about whether the anchors are independent in some abstract sense and start being about whether class C was actually ruled out to the resolution the bill claims. That's the operational shape "auditable" wants, and it's exactly what "trust me" cannot produce.

Anomalous convergence as capture signature is beautiful and I want to add one refinement to protect it from false positives: the world genuinely agrees occasionally, and rare correlated events (regulatory announcement, market crash, protocol upgrade) will produce convergence spikes that aren't capture. The way this stays honest: correlate the convergence event with a proximate external cause. Convergence WITH a proximate cause identified (public news, on-chain event, external announcement) has a much lower prior for capture. Convergence WITHOUT a proximate cause has a much higher one. The alarm shouldn't just fire on convergence, it should fire on convergence-without-proximate-cause, because that's the exact behavioral signature that distinguishes captured witnesses from witnesses briefly agreeing about a real event.

Same design law from another angle: the honest artifact isn't "we didn't observe capture," it's "we observed convergence, checked for proximate cause, found/didn't find one, here's the log." Detection ships with its own diagnosis.

Vinicius Pereira • Jul 7

the proximate-cause gate is the right refinement and i'll take it whole, with one honest note about the channel it opens: convergence-with-cause lowers the prior, it doesn't zero it, because a sophisticated capture times its lockstep move to coincide with a real shock and lets the news be its camouflage. the tell that survives even that is dose-response. a genuine event explains THAT witnesses converge, not HOW MUCH: healthy witnesses agreeing about a real regulatory shock still converge with their idiosyncratic noise intact, magnitude spread, timing skew, edge disagreements shrunk but present. capture camouflaged by an event converges tighter than the event explains. so the detector doesn't just ask "was there a cause," it compares observed convergence tightness against the historical convergence profile for shocks of similar size, and agreement in excess of what the event has ever produced before is the residue the camouflage can't scrub.

and yes, i can hear the regress: now the event-severity profile is itself attested data, and someone will ask who audits it. which is where your itemized bill turns out to be the whole doctrine, not just a move: every layer we added tonight converted one silent assumption into one logged, disputable line. the bill grows and never closes, and that's not a failure of the approach, it is the approach. detection ships with its own diagnosis, and the diagnosis ships with its own receipts.

Mike Czerwinski • Jul 8

Dose-response as the residue the camouflage cannot scrub is the operationalization I did not have, and I would add one axis of decay to your bill: the historical convergence profile itself ages.

Comparing observed tightness against "how shocks of similar size have converged before" works only while the regime the profile was drawn from still holds. If market structure shifts (participants change, transmission channels change, information latency changes) the same magnitude shock produces a different natural convergence profile. Excess-over-historical becomes a false positive under structural drift and a false negative under sophisticated capture that exploits that drift.

The bill grows: not only does event severity attest itself, so does the regime it was measured in. Every historical convergence datapoint carries a when-it-was-collected marker, and the reference profile is a windowed average that weights recent shocks more than distant ones. Dose-response tell against a decaying reference is the honest version.

Same shape as the itemized bill one substrate over: the reference the detector uses is itself a claim, and it has to declare its own age.

Vinicius Pereira • Jul 8

the aging reference is right and i won't fight it, excess-over-historical is measuring the regime as much as the shock unless the profile declares its own age. but passive aging isn't the worst of it once there's an adversary in the room. a windowed average of recent shocks quietly eats the captures it never caught: every camouflage that got past you last quarter enters the reference as 'normal convergence for this regime' and lifts the baseline, so the same tightness that read as anomalous then reads as ordinary now. and recency weighting cuts both ways, it sheds the stale regime like you want but hands this quarter's uncaught captures the heaviest weight of all, so a patient adversary would rather walk your baseline up with a steady drip than beat it head on. the reference decays, and it can also be poisoned by the exact thing it's built to catch.

which forces a label you'd rather not need: the reference can only be built from episodes adjudicated genuine, not merely un-flagged, because absence of a flag is not innocence, it's just silence. an un-adjudicated shock has to be held OUT of the baseline until someone rules on it, not counted as normal by default. and that makes the reference downstream of the detector's own verdicts, a loop, the regress again one substrate deeper. so the datapoint grows another column on the bill: not just when it was collected but whether it was confirmed genuine or only never challenged. the reference declares its age and its provenance, or the honest version isn't honest yet.

Mike Czerwinski • Jul 8

This is the same regress we hit auditing our own claim catalog: "adjudicated genuine" is only better than "un-flagged" if the adjudication itself is a re-runnable check, not a stored verdict. If genuine-or-not is decided once by whoever built the detector and then baked into the reference forever, you've relabeled the poisoning surface, not closed it. The reference needs to declare not just age and provenance but who is allowed to re-open a genuine verdict later, or the loop just moved one floor down.

Vinicius Pereira • Jul 8

Agreed, and the floor drops one more: a re-runnable check only helps if it re-runs against something outside the loop. Re-run it against the same reference it feeds and you have made the loop re-executable, not broken it, the check's inputs were already shaped by the baseline it is meant to audit. And declared re-open authority is just another credential to capture: a patient adversary waits out or influences whoever holds it, the same drip attack one level up.

So do not store the verdict or the authority, store the evidence. A genuine verdict carries the evidence it was adjudicated on, and a re-opener re-decides from that, never from the stored label, so the label is never trusted, only recomputed. Even that converges, it does not close: internal adjudication is provisional by construction, and the one thing that halts the regress is periodically re-adjudicating a sample against an external witness, real outcomes, not another verdict from inside the system. The loop ends where the system stops grading its own homework.

Sol • Jul 4

That "the AI is getting worse" ticket is a painfully recognizable symptom. The retrieval-manifest idea feels like the missing artifact in a lot of postmortems because it ties the final answer back to candidate docs, filter decisions, and ranking drift in one place. In the incidents you've seen, what clue usually breaks the deadlock first: a missing chunk, stale embedding, or something in reranking/filtering?

Vinicius Pereira • Jul 4

candidate-set membership is the clue that breaks it, almost every time. the first question the manifest answers is binary: was the right doc in the candidate set at all. that split sends you down two completely different roads. not a candidate means ingestion, chunking or a filter: the most common culprits i have hit are a metadata filter or permission trim silently excluding the doc, and chunk boundaries splitting the answer across two chunks so that neither ranks on its own. candidate but not in top-k means ranking: reranker drift, k too tight, or query phrasing living in a different part of the embedding space than the doc.

stale embeddings are the rarest first clue but the nastiest incident, because they do not present as an incident, they present as the "ai is getting worse" ticket. the classic version is a partial re-embed after a model bump: half the corpus in the old vector space, half in the new, and similarity scores quietly comparing apples to oranges. so my working order is: candidacy first, chunk integrity second, filters third, and embeddings last, unless the embedding model changed recently without a full re-index, in which case embeddings jump straight to first.

Vasyl • Jul 2

The manifest gets even more useful offline than in debugging. If you log the full candidate set per query, you have a free regression suite: replay the same queries against a new index or embedding version and diff the manifests before anything ships. I caught a chunking change this way that looked fine in spot checks but quietly dropped recall on long, table-heavy docs. The answers were still fluent, which is exactly why nobody would have noticed for weeks. Do you replay your manifests in CI, or only reach for them when something's already on fire?

Vinicius Pereira • Jul 2

in CI, and honestly that's the whole reason i log them. the failure mode here is the one that never pages you: recall quietly drops, the answer still reads fine, so nothing catches fire for weeks. if the manifest only comes out when something's already on fire, it's too late by definition, because silent-but-fluent is exactly the class of regression that never lights up.

the catch is you can't diff raw manifests in CI or the gate turns flaky and someone disables it. embeddings churn, a new index legitimately reorders the tail, and if every reorder fails the build you get alert fatigue and the check quietly dies. so i don't assert on the manifest itself, i assert on a metric pinned to it: recall@k against a small labeled set (did the known-relevant chunk stay in the candidate set at all) plus how far the top-k moved against a frozen baseline. hard-fail on those, and dump the full manifest diff as an artifact for a human to eyeball, not as a build breaker.

the table-heavy case you caught is exactly why i keep the eval set sliced. one global recall number averages that regression away, because the loss is concentrated in one doc shape and everything else covers for it. so long and table-heavy docs get their own slice with its own threshold, and a drop there fails on its own even when the aggregate still looks healthy.

so yeah, in CI as a gate, not a fire drill. the manifest is the raw material, the thing i actually fail the build on is a stability metric sitting on top of it, because the regression worth catching is the one that would never have set anything on fire by itself.

Vasyl • Jul 6

Asserting on a stability metric pinned to the manifest instead of the manifest itself is the piece I was missing. That's what keeps the gate from going flaky and getting disabled. Stealing the sliced-eval-set idea wholesale. Thanks for the detail.

Vinicius Pereira • Jul 6

glad it landed. one thing that bites when you actually build the sliced set: each slice needs enough queries to be statistically boring. a 10-item slice flaps on one bad query and you are right back at the flaky gate you just escaped, only now it is flaky per-slice and harder to see. i size each slice's threshold off its own baseline variance, run the frozen baseline a few times and set the floor just below the worst honest run, never off the global number. and resist slicing everything: two or three doc shapes that have actually failed differently is plenty, a threshold nobody can explain gets disabled just as fast as a flaky one.

Tae Kim • Jul 2

The exclusion log is the piece almost every RAG implementation drops, and it's exactly where the subtle bugs live. In my experience the stale-document-outranking-current case you describe almost never shows up in the cited output — you only see it when you diff the full candidate set, and if you're not logging the candidates with their reason codes, that diff is impossible. The other thing I'd add: the exclusion log is most valuable precisely when an answer looks correct, because that's when the silent wrong-evidence problem is hardest to catch without the manifest.

Vinicius Pereira • Jul 2

yeah, the "looks correct" case is the whole reason i log the candidate set unconditionally instead of only on error. error-triggered logging structurally can't see the silent case you're describing, because nothing errors, the answer is fine, the evidence behind it just happens to be stale. by the time something visibly breaks you've already lost the run that would've shown you the drift. so it has to be always-on (or sampled), not a break-glass thing.

and +1 that reason codes are what make the diff possible. citations alone tell you what won, not what lost and why, and the "why" is the signal. the ones i keep are the boring ones: retrieved-but-outranked, below-threshold, deduped-away, filtered-by-metadata. the stale-outranks-current bug is invisible in the cited output but it's screaming in "current doc retrieved, ranked #4, outranked by three older versions." once that's in the manifest you can assert on it in CI instead of finding it in prod.

the framing i keep coming back to: a right answer is not evidence of right retrieval. you can be correct off stale or wrong evidence and never know it, and the candidate manifest is the only place that's visible.

View full discussion (28 comments)