The 33 'grounded-but-wrong' answers were a metric artifact: how ID-based context recall lies on multi-answer datasets

#rag #llm #evaluation #retrieval

Correction note: This post corrects a claim I made in two earlier posts. I previously reported "33/100 grounded-but-wrong" answers in my JQaRA RAG eval and framed them as a retrieval/generation failure worth fixing with hybrid search. After decomposing the numbers, zero of those 33 were real failures — all 33 are an artifact of how I measured context recall. This post shows exactly how the metric misled me, because the failure mode is one a lot of people are exposed to without knowing it.

TL;DR

My pipeline used an ID-based context recall: |retrieved ∩ relevant_doc_ids| / |relevant_doc_ids|. This is a real, widely-used variant (it matches RAGAS's NonLLMContextRecall / IDBasedContextRecall).
I flagged answers as grounded-but-wrong when faithfulness ≥ 0.8 AND context_recall < 0.5. 33/100 queries got flagged.
When I checked hit@5 (did at least one relevant doc make it into the top-5 context?), it was 98/100. Retrieval was not failing.
The 33 flagged queries had a mean of 16 relevant documents each; 28 of 33 had more than 10.

With k=5, the maximum possible ID-based recall is 5/16 ≈ 0.31 — below the 0.5 threshold even for a perfect retriever. The threshold was unreachable by construction.

The only 2 genuine retrieval misses (hit@5 = 0) scored faithfulness = 0.0 and were correctly not flagged as grounded-but-wrong. The pipeline worked; the metric definition didn't fit the dataset.

The lesson isn't "RAGAS is broken." It's that a recall metric whose denominator is the relevant-document count silently breaks when the dataset has many relevant docs per query and your k is small — and that combination is easy to walk into.

What I claimed earlier

In two earlier posts I reported a JQaRA evaluation of a local RAG stack (ruri-v3 retriever, qwen3:32b generator, gemma4:31b judge). One headline number was 33/100 grounded-but-wrong: answers the judge rated highly faithful to their retrieved context, yet whose retrieved context appeared to be missing the relevant material. I read that as "the model is confidently using incomplete context," and I lined up a hybrid (BM25 + dense) experiment to fix the retrieval side.

That story was wrong. Here's how I found out.

The gate that saved the experiment

Before running hybrid, I computed a ceiling: on a fixed 100-candidate reranking dataset like JQaRA, context recall can't exceed "how often the relevant docs are even in the candidate set." The gap looked large (+0.20), so the gate said "continue."

But the rank distribution was suspicious. Among queries where a relevant doc was in the candidate set, the dense retriever already ranked it at p50 = 1, p90 = 2. If relevant docs are almost always at the very top, where is a +0.20 recall gap coming from?

So instead of running hybrid, I decomposed the gap.

The decomposition

Two numbers ended the experiment before it started.

hit@5 = 98/100. For 98 of 100 queries, at least one relevant document was in the top-5 context handed to the generator. Retrieval was essentially doing its job.

Mean relevant docs among the 33 flagged queries = 16.0, with 28 of 33 above 10 relevant docs.

Now the metric definition collides with the dataset. ID-based context recall is:

context_recall = |retrieved_doc_ids ∩ relevant_doc_ids| / |relevant_doc_ids|

With k=5 and 16 relevant docs, the best achievable value is 5/16 ≈ 0.31. The grounded-but-wrong flag fires when context_recall < 0.5. A perfect retriever scores 0.31 here and gets flagged anyway. The 0.5 threshold isn't measuring retrieval quality on these queries — it's measuring "does this query have more than ~10 relevant docs," which on JQaRA it usually does.

Swap in hit@5 (≥1 relevant doc retrieved) as the recall signal and grounded-but-wrong drops from 33 to 0. The 2 queries that genuinely retrieved nothing relevant scored faithfulness 0.0 — the judge caught them, and they were never in the 33. The pipeline was working the whole time.

Why this is easy to walk into

This isn't a RAGAS bug. ID-based / non-LLM context recall is a legitimate, documented metric, and on a single-answer dataset (one gold doc per query) the denominator is 1 and none of this happens. The trap is the interaction:

Denominator = relevant-doc count (not "claims in the reference answer," which is RAGAS's default LLM-based variant)
Many relevant docs per query (JQaRA averages well above 10)
Small k (I used 5)
A fixed threshold (0.5) applied uniformly across queries with wildly different denominators

Each choice is individually reasonable. Together they manufacture a "failure" rate that tracks dataset structure, not system quality. If you picked an ID-based recall because it's deterministic and cheap (I did — no judge calls, fully reproducible), this is exactly the blind spot you inherit.

What I'd actually do

Match the metric to the dataset's answer multiplicity. For multi-answer sets, a denominator that can exceed k makes proportion-style recall uninterpretable. Use hit@k for "did we get anything relevant," and reserve proportion recall for when k ≥ typical relevant-doc count.
Make thresholds relative, not absolute. context_recall < 0.5 means something different when the ceiling is 1.0 vs. 0.31. Normalize against the achievable ceiling, or threshold on hit@k instead.
Sanity-check any "failure" cohort against an oracle. If a perfect retriever would also be flagged, the flag is about your metric, not your system. This single check would have caught it before I wrote the first post.

The correction

I've added update notes to the two earlier posts pointing here. To be precise about what changed:

The hybrid experiment is archived, not run — its motivation no longer exists. I'd rather publish that than run an experiment to make a flawed number look better.

All numbers are recomputed directly from the eval output JSON; the analysis script and decision log are in the repo.