This is a follow-up to an earlier post where I found that my context-recall
metric over-reported retrieval failure (it flagged 33/100 answers that were
actually fine). This post is about the opposite and more dangerous failure: a
metric that under-reports. Retrieval quietly gets worse, your generation
metrics stay green, and the dashboard shows nothing. I packaged the detector
intoeval-sanityv0.2.
The failure mode
Here is a pattern that shows up repeatedly in production RAG postmortems. A
system ships with a healthy offline eval — say faithfulness around 0.9. Weeks
later, users start reporting that some fraction of answers miss a key fact. The
team checks the dashboard: faithfulness is still ~0.9. Nothing looks wrong.
What actually happened: the retriever degraded — a re-index, an embedding model
swap, a chunking change — and started missing relevant documents on a subset of
queries. But the generator kept producing fluent, internally-consistent answers
from whatever partial context it received. Faithfulness measures "is the answer
grounded in the retrieved context," not "was the right context retrieved." So
faithfulness stayed high while retrieval silently fell off.
If your dashboard tracks only generation-stage metrics, this regression is
invisible. That's the trap: the two metrics move independently, and the healthy
one masks the broken one.
Why a single eval run can't catch it
You can't see this in one snapshot. A faithfulness of 0.9 looks fine in
isolation. The signal only exists in the comparison between two runs — a
baseline and a current — where you ask: did retrieval drop while generation
held steady? That specific divergence is the fingerprint of a silent
regression.
But naive comparison creates a different problem: noise. Eval scores wobble
between runs from judge variance and sampling. If you alarm on "current is
lower than baseline," you'll fire constantly on noise, and an alarm that cries
wolf gets ignored by the second week. So the detection has to separate real
movement from jitter.
What eval-sanity v0.2 does
pip install eval-sanity
detect_regression takes two eval runs — the retrieved/relevant doc IDs you
already have, plus your generation scores (faithfulness or similar) passed in —
and reports which of four states you're in:
- silent regression (alarm): retrieval dropped significantly, generation did not move
- visible regression: both dropped — your dashboard already shows this, no alarm needed
- generation-only: generation moved, retrieval held
- stable / noise: nothing moved beyond jitter
The "significantly" is the important part. Every delta goes through a paired
bootstrap (10k resamples, fixed seed) and a 95% confidence interval. A
change counts only if its CI excludes zero. This is what keeps it from firing
on noise. It runs in a fraction of a second, with zero dependencies and no model
calls — it's pure deterministic math on the IDs and scores you already have.
A worked example
Here's a synthetic case that makes the divergence concrete (numbers from the
package's demo, not a real client system):
A baseline run with recall@5 = 0.95 and faithfulness = 0.90. A current run where
retrieval has degraded to recall@5 = 0.667, but faithfulness is unchanged at
0.90. The detector reports:
recall@5 0.95 → 0.667 CI [-0.417, -0.150] significant drop
faithfulness 0.90 → 0.90 CI [-0.005, +0.005] unchanged
*** ALARM *** SILENT REGRESSION
Retrieval dropped while generation held steady — your dashboard won't show this.
And the control case — a current run where only 2 of 60 queries flip, pure
jitter:
recall@5 CI [-0.083, +0.000] includes zero → flat
No significant change; within noise.
Same machinery, no false alarm. That control case matters more than the alarm
case: a regression detector you can't trust to stay quiet is worse than no
detector at all.
How to wire it in
The point isn't to run this once. It's to run it on every meaningful change —
a re-index, an embedder swap, a chunking tweak — comparing against your last
known-good baseline, before the change reaches users. It's a regression gate,
the retrieval-stage equivalent of a test that fails CI when generation metrics
alone would have stayed green.
A complete RAG eval program needs at least one retrieval-stage signal alongside
the generation-stage ones, precisely so the healthy metric can't hide the
broken one. eval-sanity is a small, dependency-free way to make that
retrieval-stage check a regression gate rather than a number nobody compares
across runs.
→ github.com/elvisyao007/eval-sanity
The detection logic, the bootstrap implementation, and the full test suite
(covering each of the four states plus the noise-rejection case) are in the
repo. All example numbers are from the package's own deterministic demo.

Top comments (0)