elvisyao007

Posted on Jun 7

faithfulness spread = 0.000: what self-grading RAG eval actually looks like

#ai #llm #machinelearning #rag

Update (2026-06): The grounded-but-wrong counts in this post (48/100
self-eval, 33/100 independent judge) are affected by a metric-definition
issue I found later — see blog-03 for the full analysis.
Short version: the 0.5 threshold on ID-based context recall is structurally
unreachable on multi-answer queries with k=5, so those absolute counts reflect
dataset structure more than system quality. The self-eval vs independent-judge
methodology point still stands; only the absolute numbers need this caveat.
Original text unchanged below.

description: "I ran my RAG eval twice — once with the same model grading itself, once with an independent judge from a different family. Here's what changed, and why spread = 0.000 is the tell."

Last post I claimed something specific: faithfulness scored 0.67, but an independent judge found 33 of 100 answers were grounded in context and still factually wrong.

A fair question: why trust that judge?

I have a concrete answer, because I ran the eval twice. The first run used the same model for both generation and judging — self-grading. The second run used a completely different model family as the judge. Here are the numbers from both.

The before and after

Metric	Self-judge (qwenj, same model)	Independent judge (gemma4:31b)
faithfulness mean	0.7751	0.6662
faithfulness spread	0.0000	0.0500
grounded-but-wrong	48 / 100	33 / 100

Read the spread row. The self-judge returned a spread of exactly 0.0000 — not "near zero," literally zero. Every query returned an identical faithfulness distribution. The judge was not reading the answers. It was rubber-stamping.

The independent judge returned a spread of 0.05. Small, but non-zero: the judge was actually discriminating between better and worse answers.

Everything else follows from that single difference.

Why spread = 0.000 is the tell

A judge that is genuinely evaluating will find some answers more faithful than others — it will disagree with itself across queries. A judge that has collapsed into rubber-stamping gives the same score to everything, because it has stopped reading. The variance goes flat.

Non-zero spread is necessary but not sufficient evidence of a good judge. A random judge also has spread. The spread check rules out the worst case — the complete collapse of judgment — not all cases. The gold standard is still human-label agreement on a sampled subset. But zero spread is an immediate red flag that something is wrong.

The self-judge gave faithfulness 0.7751. That number is almost certainly inflated. When the same model generates an answer and then evaluates it, it tends to recognize its own phrasing and reward it. The technical term is self-enhancement bias — a documented effect that scales with model capability and persists even when authorship is hidden.

What inflated faithfulness does downstream

Faithfulness inflation doesn't just change one number. It cascades.

The self-judge scored more answers as "faithful" (inflated 0.7751 vs 0.6662). A larger faithful pool means more opportunities to be grounded-but-wrong. That's why the self-judge found 48 grounded-but-wrong answers while the independent judge found 33: the self-judge was counting answers as "grounded" that the independent judge correctly did not. False positives in faithfulness create false positives in grounded-but-wrong.

The independent judge, being more accurate about faithfulness, shrank both numbers toward reality.

How I built the independent judge

Three things that matter:

Cross-family split. My generator is qwen3:32b (Qwen, Alibaba). My judge is gemma4:31b (Gemma, Google). Different model, different family, different training lineage. Self-preference bias leaks across a model family, not just an exact checkpoint — using a different Qwen checkpoint as the judge would still be suspect. The key is the family boundary.

Ground-truth anchor. Self-preference bites hardest on subjective tasks where there's no right answer to compare against. JQaRA ships gold answers. My correctness check asks the judge to compare the model's answer against the gold answer — not to issue a free-floating opinion. Anchoring on a reference shrinks the surface where bias can hide.

The on-prem cost. On a single RTX 5090 with 32 GB VRAM, qwen3:32b (20 GB) and gemma4:31b (19 GB) can't both be resident at the same time. I had to build a two-pass architecture: all generation first, then explicit VRAM unload, then all judging. This also required routing around the OpenAI-compat endpoint — thinking-capable models exhaust max_tokens with reasoning tokens before emitting content, so I used Ollama's native /api/chat with think=false. None of this is hard, but it's the operational reality of doing this properly on-prem, and it's the kind of friction that makes most people default to self-judging in a single pass.

Being honest about the limits

Non-zero spread rules out rubber-stamping. It doesn't prove the judge is calibrated. For that, you need to hand-label a sample — grade 30–50 answers yourself and measure how often the judge agrees. I haven't published that calibration for this run yet. The spread check is a fast sanity gate, not the finish line.

What to gate RAG eval on

An independent judge — different family, not just different checkpoint. Self-judging numbers are theater.
Ground truth where it exists. A reference answer reduces the bias surface more than any prompting trick.
Spread as a sanity check. Report it alongside the mean. Zero spread = stop, something is wrong.
Human-label calibration on a sample before you trust the judge in production.

The self-judging run gave a clean-looking 0.77 faithfulness with zero spread. The independent run gave 0.67 with 0.05 spread, and found 15 fewer grounded-but-wrong answers. The real system was worse than the self-judge claimed and better-characterized than the inflated number suggested. The 0.67 is more credible precisely because it's lower.

The full run — both phases, infrastructure fixes, raw scores — is here: github.com/elvisyao007/eval-driven-llm. Next I'm going after context_recall = 0.41 with hybrid retrieval, judged by the same independent setup. Following the build in public.

Top comments (1)

elvisyao007 • Jun 8 • Edited

Follow-up / correction: I dug into the "grounded-but-wrong" number from
this post. It turned out to be a measurement artifact — ID-based context recall
with a relevant-doc-count denominator, on a multi-answer dataset (JQaRA,
~16 relevant docs/query) at k=5, can't clear a 0.5 threshold even with perfect
retrieval. Actual hit@5 was 98/100. Full writeup: [dev.to/elvisyao007/the-33-grounded...]. Original
post left unchanged.