<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: João Paulo Traguetta Rufino</title>
    <description>The latest articles on DEV Community by João Paulo Traguetta Rufino (@joaopaulotr).</description>
    <link>https://dev.to/joaopaulotr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3895939%2F426259bf-899f-4f0f-94f7-b26ee937f320.jpeg</url>
      <title>DEV Community: João Paulo Traguetta Rufino</title>
      <link>https://dev.to/joaopaulotr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joaopaulotr"/>
    <language>en</language>
    <item>
      <title>Building an Evaluation Harness for Financial RAG: What I Learned About LLM-as-Judge Calibration</title>
      <dc:creator>João Paulo Traguetta Rufino</dc:creator>
      <pubDate>Tue, 19 May 2026 22:12:31 +0000</pubDate>
      <link>https://dev.to/joaopaulotr/building-an-evaluation-harness-for-financial-rag-what-i-learned-about-llm-as-judge-calibration-5030</link>
      <guid>https://dev.to/joaopaulotr/building-an-evaluation-harness-for-financial-rag-what-i-learned-about-llm-as-judge-calibration-5030</guid>
      <description>&lt;p&gt;I built a RAG system for financial document Q&amp;amp;A. It answers questions about SEC filings (revenue, margins, debt ratios) using 84 public company documents from the &lt;a href="https://github.com/patronus-ai/financebench" rel="noopener noreferrer"&gt;FinanceBench&lt;/a&gt; benchmark.&lt;/p&gt;

&lt;p&gt;After running 100 queries, my LLM judge said 74% of answers were correct. The actual number was 27%.&lt;/p&gt;

&lt;p&gt;This post is about how I found that gap, why it exists, and what I did about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The pipeline is straightforward: embed 84 SEC filings (10-K, 10-Q, earnings reports) into Qdrant with &lt;code&gt;text-embedding-3-small&lt;/code&gt;, retrieve top-6 chunks per query, generate answers with GPT-4o-mini.&lt;/p&gt;

&lt;p&gt;FinanceBench gives you 150 expert-annotated Q&amp;amp;A pairs with ground truth answers and source documents. I used 100 of them as my eval set.&lt;/p&gt;

&lt;p&gt;I measured quality in two tiers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1 — Retrieval.&lt;/strong&gt; Did the system find the right document? I tracked Recall@6, Precision@6, and MRR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 2 — Generation.&lt;/strong&gt; Is the answer any good? I used an LLM judge (GPT-4o-mini scoring 1-5) to evaluate Context Relevance, Answer Faithfulness, and Answer Relevance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval: decent but not great
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Recall@6&lt;/td&gt;
&lt;td&gt;0.830&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precision@6&lt;/td&gt;
&lt;td&gt;0.422&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MRR&lt;/td&gt;
&lt;td&gt;0.646&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;83 out of 100 queries retrieved the correct source document. Not bad for vanilla semantic search with zero filtering.&lt;/p&gt;

&lt;p&gt;The 17 misses were concentrated: Johnson &amp;amp; Johnson (9 misses across different doc types) and Adobe (5 misses). Together, 14 out of 17 failures came from just two companies.&lt;/p&gt;

&lt;p&gt;Why? SEC filings use nearly identical language across companies. "Net revenues increased," "operating income was impacted by" — these phrases appear in every single 10-K. Embeddings can't reliably tell 3M's filing from Coca-Cola's when the language is this similar.&lt;/p&gt;

&lt;p&gt;I confirmed metadata filtering fixes this. When I manually filtered Qdrant to only return chunks from the correct PDF, retrieval hit 100%. Automatic filtering (LLM extracts company from query, filters before retrieval) is the planned fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  The judge lies
&lt;/h2&gt;

&lt;p&gt;Here's where things got interesting.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Avg Score (1-5)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context Relevance (C|Q)&lt;/td&gt;
&lt;td&gt;3.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer Faithfulness (A|C)&lt;/td&gt;
&lt;td&gt;3.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer Relevance (A|Q)&lt;/td&gt;
&lt;td&gt;3.96&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Answer Relevance judge classified 74 out of 100 answers as correct (score &amp;gt;= 4).&lt;/p&gt;

&lt;p&gt;That felt too good for a system I knew was struggling. So I calibrated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Calibration: the part nobody does
&lt;/h2&gt;

&lt;p&gt;I took 30 query-answer pairs and manually compared them against FinanceBench's ground truth. My human accuracy was 27% — only 8 out of 30 were actually correct.&lt;/p&gt;

&lt;p&gt;Then I checked the judge against my labels:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TPR (sensitivity)&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TNR (specificity)&lt;/td&gt;
&lt;td&gt;0.55&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TPR 1.00 means when an answer is correct, the judge always catches it. Good.&lt;/p&gt;

&lt;p&gt;TNR 0.55 means when an answer is wrong, the judge only catches it 55% of the time. Almost half of wrong answers pass as correct.&lt;/p&gt;

&lt;p&gt;Real example: the judge gave 5/5 to an answer saying "$1,608M" when the ground truth was "$2,018M." The response was well-structured, cited a source, used proper financial language. It just had the wrong number.&lt;/p&gt;

&lt;p&gt;This is the core problem: &lt;strong&gt;the judge evaluates fluency, not factual accuracy.&lt;/strong&gt; It can't verify numbers because it doesn't have the ground truth to compare against.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: give the judge the answer key
&lt;/h2&gt;

&lt;p&gt;I added a fourth metric — Answer Correctness (A|GT) — where the judge prompt includes the expected answer from FinanceBench alongside the model's response. Now the judge can actually check if "$1,608M" matches "$2,018M."&lt;/p&gt;

&lt;p&gt;After adding A|GT:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TPR&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TNR&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.86&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TNR went from 0.55 to 0.86. The judge now catches 86% of wrong answers.&lt;/p&gt;

&lt;p&gt;With this calibrated judge, 53 out of 100 answers were correct. Not 74.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two judges, two purposes
&lt;/h2&gt;

&lt;p&gt;This isn't about one being better. They measure different things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A|Q (no ground truth)&lt;/strong&gt; simulates production. In a live system, you don't have the right answer — that's why the user is asking. This judge tells you if the response is coherent and relevant. Good for monitoring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A|GT (with ground truth)&lt;/strong&gt; is for development. When you have labeled data, you use it. This tells you if your pipeline is actually improving or if you're just getting more fluent wrong answers.&lt;/p&gt;

&lt;p&gt;The mistake is using only A|Q during development and trusting the numbers. My pipeline looked like 74%. It was 53%.&lt;/p&gt;

&lt;h2&gt;
  
  
  What didn't work
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Automatic metadata filtering via exact match.&lt;/strong&gt; I tried extracting the company name with the LLM and filtering Qdrant by source filename. Problem: Qdrant's match filter does exact string matching, and "Johnson &amp;amp; Johnson" doesn't match &lt;code&gt;JOHNSON_JOHNSON_2022_10K.pdf&lt;/code&gt;. Needs fuzzy or substring matching. Deferred to next phase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework default judge prompts.&lt;/strong&gt; Most RAG eval tools ship generic prompts that work for "does this make sense?" but fail for "is this number right?" If your domain requires factual precision, you need custom prompts and you need to calibrate them against human labels. There's no shortcut here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where things stand
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval Recall@6&lt;/td&gt;
&lt;td&gt;0.830&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy (calibrated)&lt;/td&gt;
&lt;td&gt;53/100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Judge TPR&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Judge TNR&lt;/td&gt;
&lt;td&gt;0.86&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pipeline retrieves the right document 83% of the time but only gives the correct answer 53% of the time. The gap comes from retrieval misses (17%) and generation errors on correctly retrieved documents.&lt;/p&gt;

&lt;p&gt;Next: systematic error analysis. Categorize every failure, pick the top 2 modes, fix them, measure impact.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/joaopaulotr/financebench-rag-eval" rel="noopener noreferrer"&gt;financebench-rag-eval&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2311.11944" rel="noopener noreferrer"&gt;FinanceBench&lt;/a&gt; — Patronus AI&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://jxnl.co" rel="noopener noreferrer"&gt;6 RAG Evals&lt;/a&gt; — Jason Liu&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hamel.dev/blog/posts/evals-faq/" rel="noopener noreferrer"&gt;LLM Evals FAQ&lt;/a&gt; — Hamel Husain&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>agents</category>
      <category>python</category>
    </item>
  </channel>
</rss>
