<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: elvisyao007</title>
    <description>The latest articles on DEV Community by elvisyao007 (@elvisyao007).</description>
    <link>https://dev.to/elvisyao007</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3964875%2Fbeb8b912-fc43-4b63-ac46-6a08efe481d9.jpg</url>
      <title>DEV Community: elvisyao007</title>
      <link>https://dev.to/elvisyao007</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/elvisyao007"/>
    <language>en</language>
    <item>
      <title>faithfulness spread = 0.000: what self-grading RAG eval actually looks like</title>
      <dc:creator>elvisyao007</dc:creator>
      <pubDate>Sun, 07 Jun 2026 18:22:53 +0000</pubDate>
      <link>https://dev.to/elvisyao007/faithfulness-spread-0000-what-self-grading-rag-eval-actually-looks-like-35mj</link>
      <guid>https://dev.to/elvisyao007/faithfulness-spread-0000-what-self-grading-rag-eval-actually-looks-like-35mj</guid>
      <description>&lt;p&gt;description: "I ran my RAG eval twice — once with the same model grading itself, once with an independent judge from a different family. Here's what changed, and why spread = 0.000 is the tell."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/elvisyao007"&gt;Last post&lt;/a&gt; I claimed something specific: faithfulness scored 0.67, but an independent judge found 33 of 100 answers were grounded in context and still factually wrong.&lt;/p&gt;

&lt;p&gt;A fair question: why trust that judge?&lt;/p&gt;

&lt;p&gt;I have a concrete answer, because I ran the eval twice. The first run used the same model for both generation and judging — self-grading. The second run used a completely different model family as the judge. Here are the numbers from both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The before and after
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz80cett7nope4mb2f85y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz80cett7nope4mb2f85y.png" alt=" " width="800" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqne7iblbr3jmau4oe4x8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqne7iblbr3jmau4oe4x8.png" alt=" " width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Self-judge (qwenj, same model)&lt;/th&gt;
&lt;th&gt;Independent judge (gemma4:31b)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;faithfulness mean&lt;/td&gt;
&lt;td&gt;0.7751&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.6662&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;faithfulness spread&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.0000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.0500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;grounded-but-wrong&lt;/td&gt;
&lt;td&gt;48 / 100&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;33 / 100&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Read the spread row. The self-judge returned a spread of exactly 0.0000 — not "near zero," literally zero. Every query returned an identical faithfulness distribution. The judge was not reading the answers. It was rubber-stamping.&lt;/p&gt;

&lt;p&gt;The independent judge returned a spread of 0.05. Small, but non-zero: the judge was actually discriminating between better and worse answers.&lt;/p&gt;

&lt;p&gt;Everything else follows from that single difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why spread = 0.000 is the tell
&lt;/h2&gt;

&lt;p&gt;A judge that is genuinely evaluating will find some answers more faithful than others — it will disagree with itself across queries. A judge that has collapsed into rubber-stamping gives the same score to everything, because it has stopped reading. The variance goes flat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-zero spread is necessary but not sufficient&lt;/strong&gt; evidence of a good judge. A random judge also has spread. The spread check rules out the worst case — the complete collapse of judgment — not all cases. The gold standard is still human-label agreement on a sampled subset. But zero spread is an immediate red flag that something is wrong.&lt;/p&gt;

&lt;p&gt;The self-judge gave faithfulness 0.7751. That number is almost certainly inflated. When the same model generates an answer and then evaluates it, it tends to recognize its own phrasing and reward it. The technical term is self-enhancement bias — a documented effect that scales with model capability and persists even when authorship is hidden.&lt;/p&gt;

&lt;h2&gt;
  
  
  What inflated faithfulness does downstream
&lt;/h2&gt;

&lt;p&gt;Faithfulness inflation doesn't just change one number. It cascades.&lt;/p&gt;

&lt;p&gt;The self-judge scored more answers as "faithful" (inflated 0.7751 vs 0.6662). A larger faithful pool means more opportunities to be grounded-but-wrong. That's why the self-judge found 48 grounded-but-wrong answers while the independent judge found 33: the self-judge was counting answers as "grounded" that the independent judge correctly did not. False positives in faithfulness create false positives in grounded-but-wrong.&lt;/p&gt;

&lt;p&gt;The independent judge, being more accurate about faithfulness, shrank both numbers toward reality.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I built the independent judge
&lt;/h2&gt;

&lt;p&gt;Three things that matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-family split.&lt;/strong&gt; My generator is &lt;code&gt;qwen3:32b&lt;/code&gt; (Qwen, Alibaba). My judge is &lt;code&gt;gemma4:31b&lt;/code&gt; (Gemma, Google). Different model, different family, different training lineage. Self-preference bias leaks across a model family, not just an exact checkpoint — using a different Qwen checkpoint as the judge would still be suspect. The key is the family boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ground-truth anchor.&lt;/strong&gt; Self-preference bites hardest on subjective tasks where there's no right answer to compare against. JQaRA ships gold answers. My correctness check asks the judge to compare the model's answer against the gold answer — not to issue a free-floating opinion. Anchoring on a reference shrinks the surface where bias can hide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The on-prem cost.&lt;/strong&gt; On a single RTX 5090 with 32 GB VRAM, &lt;code&gt;qwen3:32b&lt;/code&gt; (20 GB) and &lt;code&gt;gemma4:31b&lt;/code&gt; (19 GB) can't both be resident at the same time. I had to build a two-pass architecture: all generation first, then explicit VRAM unload, then all judging. This also required routing around the OpenAI-compat endpoint — thinking-capable models exhaust &lt;code&gt;max_tokens&lt;/code&gt; with reasoning tokens before emitting content, so I used Ollama's native &lt;code&gt;/api/chat&lt;/code&gt; with &lt;code&gt;think=false&lt;/code&gt;. None of this is hard, but it's the operational reality of doing this properly on-prem, and it's the kind of friction that makes most people default to self-judging in a single pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Being honest about the limits
&lt;/h2&gt;

&lt;p&gt;Non-zero spread rules out rubber-stamping. It doesn't prove the judge is calibrated. For that, you need to hand-label a sample — grade 30–50 answers yourself and measure how often the judge agrees. I haven't published that calibration for this run yet. The spread check is a fast sanity gate, not the finish line.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to gate RAG eval on
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;An independent judge — different family, not just different checkpoint.&lt;/strong&gt; Self-judging numbers are theater.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ground truth where it exists.&lt;/strong&gt; A reference answer reduces the bias surface more than any prompting trick.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spread as a sanity check.&lt;/strong&gt; Report it alongside the mean. Zero spread = stop, something is wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-label calibration on a sample&lt;/strong&gt; before you trust the judge in production.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The self-judging run gave a clean-looking 0.77 faithfulness with zero spread. The independent run gave 0.67 with 0.05 spread, and found 15 fewer grounded-but-wrong answers. The real system was worse than the self-judge claimed and better-characterized than the inflated number suggested. The 0.67 is more credible precisely because it's lower.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next
&lt;/h2&gt;

&lt;p&gt;The full run — both phases, infrastructure fixes, raw scores — is here: &lt;strong&gt;&lt;a href="https://github.com/elvisyao007/eval-driven-llm" rel="noopener noreferrer"&gt;github.com/elvisyao007/eval-driven-llm&lt;/a&gt;&lt;/strong&gt;. Next I'm going after &lt;code&gt;context_recall = 0.41&lt;/code&gt; with hybrid retrieval, judged by the same independent setup. Following the build in public.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>My RAG's faithfulness was 0.67. 1 in 3 answers were still wrong.</title>
      <dc:creator>elvisyao007</dc:creator>
      <pubDate>Sun, 07 Jun 2026 17:02:51 +0000</pubDate>
      <link>https://dev.to/elvisyao007/my-rags-faithfulness-was-067-1-in-3-answers-were-still-wrong-31f3</link>
      <guid>https://dev.to/elvisyao007/my-rags-faithfulness-was-067-1-in-3-answers-were-still-wrong-31f3</guid>
      <description>&lt;h2&gt;
  
  
  description: "An on-prem JQaRA eval. Reranking nudged P@1 but the system was still wrong a third of the time. Why faithfulness alone is a trap, and what to gate on instead."
&lt;/h2&gt;

&lt;p&gt;I built a small Japanese RAG system, ran it entirely on my own hardware (RTX 5090, Ollama), and evaluated it with an &lt;strong&gt;independent judge model&lt;/strong&gt; instead of letting the generator grade its own homework.&lt;/p&gt;

&lt;p&gt;Two things surprised me, and they're connected:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Adding a reranker — the move everyone reaches for first — barely moved the needle.&lt;/li&gt;
&lt;li&gt;My faithfulness score looked acceptable (0.67), yet &lt;strong&gt;33 out of 100 answers were grounded in the retrieved context and still factually wrong&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This post is about why those two facts are the same story, and why a faithfulness gate alone would have shipped a system that's wrong a third of the time without ever flagging it.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reranking improved P@1 by +1.3 points but &lt;em&gt;lowered&lt;/em&gt; Recall@10.&lt;/strong&gt; It reorders what retrieval already found; it can't retrieve what retrieval missed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The real bottleneck was recall&lt;/strong&gt; (&lt;code&gt;context_recall = 0.41&lt;/code&gt;): the evidence needed to answer often wasn't retrieved at all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;faithfulness = 0.67&lt;/code&gt; is a trap.&lt;/strong&gt; Faithfulness measures whether an answer is consistent with the retrieved context — &lt;em&gt;not&lt;/em&gt; whether it's correct. An answer grounded in wrong-but-retrieved context scores as faithful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An independent correctness judge found 33/100 "grounded-but-wrong" answers&lt;/strong&gt; — confidently wrong, fully grounded, invisible to faithfulness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; faithfulness is necessary, not sufficient. Gate on answer-correctness + context_recall, and stop reaching for a reranker when recall is your problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The setup (so you can trust the numbers)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Benchmark&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://huggingface.co/datasets/hotchpotch/JQaRA" rel="noopener noreferrer"&gt;JQaRA&lt;/a&gt; (じゃくら) — Japanese QA-for-retrieval, built on the JAQKET quiz set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval eval&lt;/td&gt;
&lt;td&gt;1,667 queries, deterministic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generation eval&lt;/td&gt;
&lt;td&gt;100 queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generator&lt;/td&gt;
&lt;td&gt;&lt;code&gt;qwen3:32b&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Judge&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;gemma4:31b&lt;/code&gt; — &lt;strong&gt;a different model from the generator&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware&lt;/td&gt;
&lt;td&gt;single RTX 5090, on-prem, Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwq1mjj56q3c2anhol4r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwq1mjj56q3c2anhol4r.png" alt=" " width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The judge being a &lt;em&gt;different&lt;/em&gt; model matters, and I'll come back to why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Act 1: the obvious move — add a reranker
&lt;/h2&gt;

&lt;p&gt;The standard RAG upgrade path: dense retrieval is your first stage, a cross-encoder reranker is your second. So I added one and re-ran retrieval.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Dense&lt;/th&gt;
&lt;th&gt;Dense + rerank&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P@1&lt;/td&gt;
&lt;td&gt;0.8308&lt;/td&gt;
&lt;td&gt;0.8440&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.0132&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recall@10&lt;/td&gt;
&lt;td&gt;0.5738&lt;/td&gt;
&lt;td&gt;0.5634&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−0.0104&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9tsovscvj2y4dbzs99l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9tsovscvj2y4dbzs99l.png" alt=" " width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Read that carefully. The reranker did exactly what a reranker does: it &lt;strong&gt;sharpened the top of the list&lt;/strong&gt; (P@1 up — the single best document lands at rank 1 more often) while &lt;strong&gt;slightly demoting some relevant docs out of the top 10&lt;/strong&gt; (Recall@10 down). That's a precision-for-recall trade, not a free win.&lt;/p&gt;

&lt;p&gt;And here's the thing that should give you pause: if your generator reads more than the top result — top-5, top-10 — that recall drop can &lt;em&gt;hurt&lt;/em&gt; downstream answers even as P@1 improves. The metric you celebrate isn't the metric that feeds your generator.&lt;/p&gt;

&lt;p&gt;The deeper problem: &lt;strong&gt;a reranker reorders the candidate set. It cannot conjure a document that dense retrieval never surfaced.&lt;/strong&gt; Which brings us to the number that actually mattered.&lt;/p&gt;

&lt;h2&gt;
  
  
  Act 2: the metric I trusted too much
&lt;/h2&gt;

&lt;p&gt;I moved to generation eval expecting faithfulness to be the headline. It came back at &lt;strong&gt;0.6662&lt;/strong&gt;. Mediocre, but the kind of number you squint at and think "okay-ish, ship the next iteration."&lt;/p&gt;

&lt;p&gt;That instinct is the trap.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;What it actually tells you&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;faithfulness&lt;/td&gt;
&lt;td&gt;0.6662&lt;/td&gt;
&lt;td&gt;"Looks okay" — and is dangerously incomplete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;faithfulness spread&lt;/td&gt;
&lt;td&gt;0.0500&lt;/td&gt;
&lt;td&gt;Non-zero → the judge is discriminating, not rubber-stamping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;context_recall&lt;/td&gt;
&lt;td&gt;0.4062&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;The real bottleneck — evidence often wasn't retrieved&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;grounded-but-wrong&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;33 / 100&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The failures faithfulness structurally cannot see&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptk8lav9uvgtqzngdfwj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptk8lav9uvgtqzngdfwj.png" alt=" " width="800" height="294"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Faithfulness measures consistency with the retrieved context, not correctness against ground truth.&lt;/strong&gt; An answer that faithfully reports a wrong-but-retrieved passage is, by definition, &lt;em&gt;faithful&lt;/em&gt;. So a grounded-but-wrong answer doesn't lower your faithfulness score — it sits in the "good" portion of it. Optimize for faithfulness and you are partly optimizing toward confident, well-grounded, wrong answers.&lt;/p&gt;

&lt;p&gt;To catch this you need a separate question: &lt;em&gt;is the answer actually correct?&lt;/em&gt; I ran that as an independent correctness check against JQaRA's gold answers. The essence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Not "is the answer supported by the context?" (faithfulness)
# But "is the answer correct vs the gold answer?" (correctness)

judge(question, model_answer, gold_answer) -&amp;gt; {correct | incorrect}
grounded_but_wrong = faithful(answer) AND NOT correct(answer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: &lt;strong&gt;33 of 100 answers were faithful and wrong at the same time.&lt;/strong&gt; A faithfulness gate would have waved every one of them through.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happened: recall was the leak
&lt;/h2&gt;

&lt;p&gt;The three numbers line up into one causal chain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;context_recall = 0.41&lt;/code&gt; → for most queries, the passage that actually answers the question &lt;strong&gt;wasn't in the retrieved context&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The generator answers anyway, grounding itself in whatever &lt;em&gt;was&lt;/em&gt; retrieved — confidently, fluently.&lt;/li&gt;
&lt;li&gt;That answer is faithful (grounded in retrieved text) and wrong (the retrieved text didn't contain the answer). → &lt;strong&gt;grounded-but-wrong&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So &lt;code&gt;context_recall&lt;/code&gt; is the leading indicator, &lt;code&gt;grounded-but-wrong&lt;/code&gt; is the lagging confirmation, and &lt;code&gt;faithfulness&lt;/code&gt; is the misleading number in the middle that papers over both.&lt;/p&gt;

&lt;p&gt;And now Act 1 and Act 2 close into the same loop: &lt;strong&gt;I reached for a reranker, but reranking optimizes the wrong stage when recall is your bottleneck.&lt;/strong&gt; No amount of reordering fixes a document that was never retrieved. The right lever was upstream — chunking, embedding model, hybrid (lexical + dense) retrieval, query expansion — not a cross-encoder polishing a list that's missing the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on judge independence (why the spread matters)
&lt;/h2&gt;

&lt;p&gt;If you let a model grade its own outputs, it tends to like them — LLM-as-judge has a well-documented self-preference bias, and a self-judging setup often produces near-1.0 scores with almost no variance. That near-zero spread is the tell.&lt;/p&gt;

&lt;p&gt;My judge (&lt;code&gt;gemma4:31b&lt;/code&gt;) is a different model from the generator (&lt;code&gt;qwen3:32b&lt;/code&gt;), and the faithfulness spread came back at &lt;strong&gt;0.05 — non-zero&lt;/strong&gt;. Small, but it's the proof that the judge is actually discriminating between good and bad answers rather than rubber-stamping. If you take one process habit from this post, take this one: &lt;strong&gt;never let the model that wrote the answer be the model that scores it.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd actually gate a production RAG on
&lt;/h2&gt;

&lt;p&gt;Most "RAG eval" stops at faithfulness because it's the easiest to compute. That's exactly why it's the wrong place to stop. The gate I'd ship behind:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Answer-correctness vs ground truth&lt;/strong&gt; — the metric that actually catches grounded-but-wrong. Non-negotiable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;context_recall&lt;/strong&gt; — your leading indicator. If this is low, fix retrieval before you touch the generator or reach for a reranker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;faithfulness&lt;/strong&gt; — keep it, but only as a hallucination guard &lt;em&gt;on top of&lt;/em&gt; correctness, never as a stand-in for it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An independent judge&lt;/strong&gt; — different model, and watch the score variance to confirm it isn't rubber-stamping.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A demo proves the happy path works. A system you'd put in front of a business has to know — and &lt;em&gt;prove with numbers&lt;/em&gt; — how often it's confidently wrong. The gap between those two is exactly this eval discipline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next
&lt;/h2&gt;

&lt;p&gt;Code, the eval harness, and the raw run are here: &lt;strong&gt;&lt;a href="https://github.com/elvisyao007/eval-driven-llm" rel="noopener noreferrer"&gt;github.com/elvisyao007/eval-driven-llm&lt;/a&gt;&lt;/strong&gt;. Next I'm going after that &lt;code&gt;context_recall = 0.41&lt;/code&gt; — hybrid retrieval and chunking experiments, measured the same way. Following the build in public.&lt;/p&gt;

&lt;p&gt;If you run RAG eval and &lt;em&gt;only&lt;/em&gt; look at faithfulness, go check your grounded-but-wrong rate. I'd bet it's not zero.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
