DEV Community

elvisyao007
elvisyao007

Posted on

I built a tiny tool to catch the metric trap from my last post

In my last post I found that 33/100 "grounded-but-wrong" answers in my RAG
eval were a measurement artifact — not real failures. The culprit: proportion
recall with a relevant-doc-count denominator silently breaks on multi-answer
datasets when k is small.

So I packaged the diagnostic into a standalone tool: eval-sanity.

pip install eval-sanity
Enter fullscreen mode Exit fullscreen mode

It takes the retrieved and relevant doc IDs you already have and tells you
whether your recall metric is structurally capable of saying what you think
it says — before you trust the number on your dashboard.

What it checks:

  • oracle ceiling: the best any retriever could score at your k
  • threshold reachability: how many queries can never clear your threshold, regardless of retrieval quality
  • hit@k vs proportion divergence: where the two metrics disagree

Zero dependencies. No models. No judge calls. Pure deterministic math.

github.com/elvisyao007/eval-sanity

The motivation story is in the blog post that found the artifact.

Top comments (0)