I built a tiny tool to catch the metric trap from my last post

#rag #evaluation #python #opensource

In my last post I found that 33/100 "grounded-but-wrong" answers in my RAG
eval were a measurement artifact — not real failures. The culprit: proportion
recall with a relevant-doc-count denominator silently breaks on multi-answer
datasets when k is small.

So I packaged the diagnostic into a standalone tool: eval-sanity.

pip install eval-sanity

It takes the retrieved and relevant doc IDs you already have and tells you
whether your recall metric is structurally capable of saying what you think
it says — before you trust the number on your dashboard.

What it checks:

oracle ceiling: the best any retriever could score at your k
threshold reachability: how many queries can never clear your threshold, regardless of retrieval quality
hit@k vs proportion divergence: where the two metrics disagree

Zero dependencies. No models. No judge calls. Pure deterministic math.

→ github.com/elvisyao007/eval-sanity

The motivation story is in the blog post that found the artifact.