<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Melissa D. Ellison</title>
    <description>The latest articles on DEV Community by Melissa D. Ellison (@monongahelahellbender).</description>
    <link>https://dev.to/monongahelahellbender</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4012903%2F88cd3694-42d4-4110-88d1-866174ab6244.png</url>
      <title>DEV Community: Melissa D. Ellison</title>
      <link>https://dev.to/monongahelahellbender</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/monongahelahellbender"/>
    <language>en</language>
    <item>
      <title>A RAG evaluator that admits what it can't judge</title>
      <dc:creator>Melissa D. Ellison</dc:creator>
      <pubDate>Fri, 03 Jul 2026 02:08:43 +0000</pubDate>
      <link>https://dev.to/monongahelahellbender/a-rag-evaluator-that-admits-what-it-cant-judge-ad2</link>
      <guid>https://dev.to/monongahelahellbender/a-rag-evaluator-that-admits-what-it-cant-judge-ad2</guid>
      <description>&lt;p&gt;&lt;em&gt;Fail-closed groundedness, deterministic corroborators, and a self-test — because an evaluator should be more trustworthy than the thing it grades.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The quiet flaw in "LLM-as-judge" evals
&lt;/h2&gt;

&lt;p&gt;Most tools that score AI output are an LLM grading an LLM, and they report every number in the same confident voice — the verified ones and the guessed ones alike. For evaluation that's backwards. An evaluator's whole job is to be &lt;em&gt;more&lt;/em&gt; trustworthy than the model it grades, not equally credulous.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;rag-triad&lt;/code&gt; is a small local evaluator for retrieval-augmented answers built on one rule: &lt;strong&gt;lean on a deterministic check wherever one exists, and abstain — out loud — wherever one doesn't.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Localizing the failure, not just scoring it
&lt;/h2&gt;

&lt;p&gt;A RAG answer fails in three different places — bad retrieval, hallucinated generation, or an off-topic reply — and each needs a different fix. A single quality score can't tell them apart. The triad can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;context relevance ✗ → &lt;strong&gt;retrieval miss&lt;/strong&gt; (fix chunking / embeddings / top-k)&lt;/li&gt;
&lt;li&gt;groundedness ✗ → &lt;strong&gt;hallucination&lt;/strong&gt; (fix generation, or enforce cite-and-verify)&lt;/li&gt;
&lt;li&gt;answer relevance ✗ → &lt;strong&gt;off-topic&lt;/strong&gt; (fix the prompt)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The discipline (this is the actual contribution)
&lt;/h2&gt;

&lt;p&gt;The triad framing is standard (TruLens, RAGAS). What's different:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fail-closed groundedness&lt;/strong&gt; — the judge must cite a quote; &lt;em&gt;code&lt;/em&gt; verifies it's in the context, so a fabricated citation can't pass. Worst case is an honest DEFER.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A deterministic corroborator matched to each leg's failure mode&lt;/strong&gt; — an embedding-similarity floor for context relevance; an answer-&lt;em&gt;type&lt;/em&gt; gate for answer relevance (reusing the embedding trick on the answer leg would backfire — cosine rewards topical-but-evasive answers). &lt;em&gt;The signal has to fit the failure.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Judges abstain instead of bluffing&lt;/strong&gt; — sample N times; disagreement → ABSTAIN, not a fake score.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate the validator&lt;/strong&gt; — &lt;code&gt;--selftest&lt;/code&gt; runs planted failures it must catch before you trust it.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why calibration is the point
&lt;/h2&gt;

&lt;p&gt;The property that makes downstream AI safe isn't raw capability — it's calibration. A more capable model that's &lt;em&gt;confidently wrong&lt;/em&gt; is more dangerous than a weaker one that abstains. (I've watched a newer model generation shift on hard computational questions from confident-wrong to honestly-inconclusive — exactly the move an evaluator should reward and a naive scorer misses.) So &lt;code&gt;rag-triad&lt;/code&gt; prizes the honest "I can't tell" over the confident guess.&lt;/p&gt;

&lt;p&gt;Code + a one-command demo: &lt;strong&gt;github.com/MonongahelaHellbender/rag-triad&lt;/strong&gt;. Runs locally on Ollama, MIT.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>rag</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
