description: "An on-prem JQaRA eval. Reranking nudged P@1 but the system was still wrong a third of the time. Why faithfulness alone is a trap, and what to gate on instead."
I built a small Japanese RAG system, ran it entirely on my own hardware (RTX 5090, Ollama), and evaluated it with an independent judge model instead of letting the generator grade its own homework.
Two things surprised me, and they're connected:
- Adding a reranker — the move everyone reaches for first — barely moved the needle.
- My faithfulness score looked acceptable (0.67), yet 33 out of 100 answers were grounded in the retrieved context and still factually wrong.
This post is about why those two facts are the same story, and why a faithfulness gate alone would have shipped a system that's wrong a third of the time without ever flagging it.
TL;DR
- Reranking improved P@1 by +1.3 points but lowered Recall@10. It reorders what retrieval already found; it can't retrieve what retrieval missed.
-
The real bottleneck was recall (
context_recall = 0.41): the evidence needed to answer often wasn't retrieved at all. -
faithfulness = 0.67is a trap. Faithfulness measures whether an answer is consistent with the retrieved context — not whether it's correct. An answer grounded in wrong-but-retrieved context scores as faithful. - An independent correctness judge found 33/100 "grounded-but-wrong" answers — confidently wrong, fully grounded, invisible to faithfulness.
- Lesson: faithfulness is necessary, not sufficient. Gate on answer-correctness + context_recall, and stop reaching for a reranker when recall is your problem.
The setup (so you can trust the numbers)
| Component | Choice |
|---|---|
| Benchmark | JQaRA (じゃくら) — Japanese QA-for-retrieval, built on the JAQKET quiz set |
| Retrieval eval | 1,667 queries, deterministic |
| Generation eval | 100 queries |
| Generator | qwen3:32b |
| Judge |
gemma4:31b — a different model from the generator
|
| Hardware | single RTX 5090, on-prem, Ollama |
The judge being a different model matters, and I'll come back to why.
Act 1: the obvious move — add a reranker
The standard RAG upgrade path: dense retrieval is your first stage, a cross-encoder reranker is your second. So I added one and re-ran retrieval.
| Metric | Dense | Dense + rerank | Δ |
|---|---|---|---|
| P@1 | 0.8308 | 0.8440 | +0.0132 |
| Recall@10 | 0.5738 | 0.5634 | −0.0104 |
Read that carefully. The reranker did exactly what a reranker does: it sharpened the top of the list (P@1 up — the single best document lands at rank 1 more often) while slightly demoting some relevant docs out of the top 10 (Recall@10 down). That's a precision-for-recall trade, not a free win.
And here's the thing that should give you pause: if your generator reads more than the top result — top-5, top-10 — that recall drop can hurt downstream answers even as P@1 improves. The metric you celebrate isn't the metric that feeds your generator.
The deeper problem: a reranker reorders the candidate set. It cannot conjure a document that dense retrieval never surfaced. Which brings us to the number that actually mattered.
Act 2: the metric I trusted too much
I moved to generation eval expecting faithfulness to be the headline. It came back at 0.6662. Mediocre, but the kind of number you squint at and think "okay-ish, ship the next iteration."
That instinct is the trap.
| Metric | Value | What it actually tells you |
|---|---|---|
| faithfulness | 0.6662 | "Looks okay" — and is dangerously incomplete |
| faithfulness spread | 0.0500 | Non-zero → the judge is discriminating, not rubber-stamping |
| context_recall | 0.4062 | The real bottleneck — evidence often wasn't retrieved |
| grounded-but-wrong | 33 / 100 | The failures faithfulness structurally cannot see |
Faithfulness measures consistency with the retrieved context, not correctness against ground truth. An answer that faithfully reports a wrong-but-retrieved passage is, by definition, faithful. So a grounded-but-wrong answer doesn't lower your faithfulness score — it sits in the "good" portion of it. Optimize for faithfulness and you are partly optimizing toward confident, well-grounded, wrong answers.
To catch this you need a separate question: is the answer actually correct? I ran that as an independent correctness check against JQaRA's gold answers. The essence:
# Not "is the answer supported by the context?" (faithfulness)
# But "is the answer correct vs the gold answer?" (correctness)
judge(question, model_answer, gold_answer) -> {correct | incorrect}
grounded_but_wrong = faithful(answer) AND NOT correct(answer)
Result: 33 of 100 answers were faithful and wrong at the same time. A faithfulness gate would have waved every one of them through.
Why this happened: recall was the leak
The three numbers line up into one causal chain:
-
context_recall = 0.41→ for most queries, the passage that actually answers the question wasn't in the retrieved context. - The generator answers anyway, grounding itself in whatever was retrieved — confidently, fluently.
- That answer is faithful (grounded in retrieved text) and wrong (the retrieved text didn't contain the answer). → grounded-but-wrong.
So context_recall is the leading indicator, grounded-but-wrong is the lagging confirmation, and faithfulness is the misleading number in the middle that papers over both.
And now Act 1 and Act 2 close into the same loop: I reached for a reranker, but reranking optimizes the wrong stage when recall is your bottleneck. No amount of reordering fixes a document that was never retrieved. The right lever was upstream — chunking, embedding model, hybrid (lexical + dense) retrieval, query expansion — not a cross-encoder polishing a list that's missing the answer.
A note on judge independence (why the spread matters)
If you let a model grade its own outputs, it tends to like them — LLM-as-judge has a well-documented self-preference bias, and a self-judging setup often produces near-1.0 scores with almost no variance. That near-zero spread is the tell.
My judge (gemma4:31b) is a different model from the generator (qwen3:32b), and the faithfulness spread came back at 0.05 — non-zero. Small, but it's the proof that the judge is actually discriminating between good and bad answers rather than rubber-stamping. If you take one process habit from this post, take this one: never let the model that wrote the answer be the model that scores it.
What I'd actually gate a production RAG on
Most "RAG eval" stops at faithfulness because it's the easiest to compute. That's exactly why it's the wrong place to stop. The gate I'd ship behind:
- Answer-correctness vs ground truth — the metric that actually catches grounded-but-wrong. Non-negotiable.
- context_recall — your leading indicator. If this is low, fix retrieval before you touch the generator or reach for a reranker.
- faithfulness — keep it, but only as a hallucination guard on top of correctness, never as a stand-in for it.
- An independent judge — different model, and watch the score variance to confirm it isn't rubber-stamping.
A demo proves the happy path works. A system you'd put in front of a business has to know — and prove with numbers — how often it's confidently wrong. The gap between those two is exactly this eval discipline.
Next
Code, the eval harness, and the raw run are here: github.com/elvisyao007/eval-driven-llm. Next I'm going after that context_recall = 0.41 — hybrid retrieval and chunking experiments, measured the same way. Following the build in public.
If you run RAG eval and only look at faithfulness, go check your grounded-but-wrong rate. I'd bet it's not zero.



Top comments (0)