In my last post I found that 33/100 "grounded-but-wrong" answers in my RAG
eval were a measurement artifact — not real failures. The culprit: proportion
recall with a relevant-doc-count denominator silently breaks on multi-answer
datasets when k is small.
So I packaged the diagnostic into a standalone tool: eval-sanity.
pip install eval-sanity
It takes the retrieved and relevant doc IDs you already have and tells you
whether your recall metric is structurally capable of saying what you think
it says — before you trust the number on your dashboard.
What it checks:
- oracle ceiling: the best any retriever could score at your k
- threshold reachability: how many queries can never clear your threshold, regardless of retrieval quality
- hit@k vs proportion divergence: where the two metrics disagree
Zero dependencies. No models. No judge calls. Pure deterministic math.
→ github.com/elvisyao007/eval-sanity
The motivation story is in the blog post that found the artifact.

Top comments (0)