<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shiva Shrestha</title>
    <description>The latest articles on DEV Community by Shiva Shrestha (@shiva_shrestha_1b37675aab).</description>
    <link>https://dev.to/shiva_shrestha_1b37675aab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3913866%2Fc21d6de0-d5b8-4219-bbe1-02f329bba992.png</url>
      <title>DEV Community: Shiva Shrestha</title>
      <link>https://dev.to/shiva_shrestha_1b37675aab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shiva_shrestha_1b37675aab"/>
    <language>en</language>
    <item>
      <title>Building a RAG Evaluation Harness That Actually Catches Problems</title>
      <dc:creator>Shiva Shrestha</dc:creator>
      <pubDate>Tue, 05 May 2026 13:10:37 +0000</pubDate>
      <link>https://dev.to/shiva_shrestha_1b37675aab/building-a-rag-evaluation-harness-that-actually-catches-problems-198i</link>
      <guid>https://dev.to/shiva_shrestha_1b37675aab/building-a-rag-evaluation-harness-that-actually-catches-problems-198i</guid>
      <description>&lt;p&gt;Most "chat with your website" projects ship without any measurement. Mine did too. The live demo was up, answers looked plausible, and I moved on. Then I built a proper evaluation harness and found out exactly how wrong "looks plausible" is as a quality signal.&lt;/p&gt;

&lt;p&gt;This post covers the eval design, the bugs it caught, the prompt changes that fixed most of them, and the two metrics that still don't pass threshold after all the fixes. The failures are the interesting part.&lt;/p&gt;




&lt;h2&gt;
  
  
  The System
&lt;/h2&gt;

&lt;p&gt;Web Intelligence is a RAG pipeline that turns any public URL into a queryable knowledge base. You give it a URL, it crawls up to 50 pages, chunks and embeds the text with Pinecone's &lt;code&gt;multilingual-e5-large&lt;/code&gt;, and stores vectors in a serverless Pinecone index. At query time, top-k chunks are retrieved and passed to an LLM (Gemini 2.0 Flash or a local Ollama model) with a strict context-only prompt.&lt;/p&gt;

&lt;p&gt;Nothing exotic. The evaluation harness is the part I want to talk about.&lt;/p&gt;




&lt;h2&gt;
  
  
  Eval Design: The Answerable/Unanswerable Split
&lt;/h2&gt;

&lt;p&gt;Before writing a single metric, the most important design decision is splitting your question bank.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;All eval questions
├── Answerable    → Hit@k · MRR · Faithfulness · Hallucination · Ctx Coverage
└── Unanswerable  → Rejection Rate (did the system correctly refuse?)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because they measure fundamentally different behaviours. An unanswerable question where the system correctly refuses should not contribute &lt;code&gt;Hit@1 = 0&lt;/code&gt; to your retrieval average. Before I introduced the split, three out-of-scope questions were dragging down the Hit@k numbers, and there was no metric at all for whether the refusals were happening. The system was getting credit for nothing and penalised for things it was doing right.&lt;/p&gt;

&lt;p&gt;The baseline: &lt;code&gt;aboutamazon.com&lt;/code&gt;, 5 answerable questions + 3 unanswerable questions, &lt;code&gt;top_k=5&lt;/code&gt;. Small sample - I'll address that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Issue 1: Hit@1 Was 60% for the Wrong Reason
&lt;/h2&gt;

&lt;p&gt;Two of five questions scored Hit@1 = 0. For Q01 ("What does Amazon do?"), the top-ranked chunk by cosine similarity (0.857) was Amazon's mission statement is clearly relevant. But my ground-truth keyword was &lt;code&gt;"ecommerce"&lt;/code&gt; and the chunk text used &lt;code&gt;"e-commerce"&lt;/code&gt; with a hyphen.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Original - breaks on surface-form variants
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_hit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Fixed — normalise before comparison
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_norm_kw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[\s\-_]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_hit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;norm_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_norm_kw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_norm_kw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;norm_text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: Hit@1 60% → &lt;strong&gt;80%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Q03 had a harder problem alongside the normalisation bug: the top chunk genuinely addressed Amazon's mission rather than its business lines, which is what the question targeted. That's a ranking problem. The embedding is working correctly - the mission statement is semantically related to "what Amazon does" - but a cross-encoder re-ranker scoring (query, chunk) pairs jointly would promote the more task-relevant chunk. That fix is still pending.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fle34d5bnbih389zwiu9h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fle34d5bnbih389zwiu9h.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Issue 2: Hallucination Was 41% but the Metric Was Partly Lying
&lt;/h2&gt;

&lt;p&gt;Before the prompt fix, hallucination averaged 41%. After the fix, it dropped to 28%. But the story of &lt;em&gt;why&lt;/em&gt; it was 41% is more useful than the number.&lt;/p&gt;

&lt;p&gt;The hallucination metric is &lt;code&gt;1 - ctx_coverage&lt;/code&gt;, where:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ctx_coverage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;answer_tokens&lt;/span&gt; &lt;span class="err"&gt;∩&lt;/span&gt; &lt;span class="n"&gt;context_tokens&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;answer_tokens&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With NLTK stopwords removed. The problem: &lt;strong&gt;verbosity inflates this metric without representing actual fabrication.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With my original prompt (&lt;code&gt;"Prioritise the provided context"&lt;/code&gt;, &lt;code&gt;"Under 400 words"&lt;/code&gt;), answers averaged 219 words. The LLM produced long, connector-heavy responses. Words like &lt;code&gt;"Overall"&lt;/code&gt;, &lt;code&gt;"As a result"&lt;/code&gt;, &lt;code&gt;"combining"&lt;/code&gt;, &lt;code&gt;"leveraging"&lt;/code&gt; don't appear in the retrieved chunks — but they're not factual claims either. They counted as hallucinated tokens.&lt;/p&gt;

&lt;p&gt;I separated these two failure modes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Factual Risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM knowledge leakage&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;"Career Choice"&lt;/code&gt;, &lt;code&gt;"The Climate Pledge"&lt;/code&gt; inserted from training&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Connector expansion&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;"Overall, Amazon combines…"&lt;/code&gt;, &lt;code&gt;"As a result…"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fix: a &lt;code&gt;hallucination_cw&lt;/code&gt; metric that counts only content words ≥5 characters. Connector words (&lt;code&gt;"overall"&lt;/code&gt;, &lt;code&gt;"result"&lt;/code&gt;, &lt;code&gt;"based"&lt;/code&gt;) are under that threshold and excluded. The &lt;code&gt;verbosity_score&lt;/code&gt; field (&lt;code&gt;max(0, (words − 150) / 150)&lt;/code&gt;) quantifies how much of the raw metric is inflation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Issue 3: The Prompt Was Too Soft
&lt;/h2&gt;

&lt;p&gt;The original prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a website content assistant. 
Prioritise the provided context when answering.
Under 400 words.

CONTEXT:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

QUESTION:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;"Prioritise"&lt;/code&gt; is not a constraint. The LLM treated it as a suggestion. On Amazon-specific questions, it injected training knowledge: product names, operational statistics, initiatives that weren't in any retrieved chunk.&lt;/p&gt;

&lt;p&gt;The fixed prompt (current &lt;code&gt;rag.py&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a website content assistant. Answer ONLY using the text in the CONTEXT section below.

Rules:
- ONLY use information explicitly present in the CONTEXT. Do not add facts, names, or details from your training knowledge.
- If the context has nothing relevant, respond exactly: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sorry, I couldn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t find this information. Please try another question.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
- Be concise and specific. No filler, no elaboration beyond what the context states.
- Under 150 words. If the question genuinely requires more, cap at 200 words maximum.

CONTEXT:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

QUESTION:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

ANSWER (cite only what the CONTEXT states):&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before/after:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg words&lt;/td&gt;
&lt;td&gt;219&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;97&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≤ 150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination (raw)&lt;/td&gt;
&lt;td&gt;~41%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;27%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination (CW) ★&lt;/td&gt;
&lt;td&gt;~41%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≤ 25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ctx Coverage&lt;/td&gt;
&lt;td&gt;59%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 65%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Two Metrics That Still Fail
&lt;/h2&gt;

&lt;p&gt;Honest reporting: two checks are still red after all the fixes.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Hallucination (CW) 28% vs 25% threshold
&lt;/h3&gt;

&lt;p&gt;Three points off. The verbosity fix eliminated most of the signal. What remains is genuine leakage, 2 to 3 content words per answer that came from training knowledge rather than retrieved chunks. The 150-word cap reduced it but didn't eliminate it. The next step is LLM-as-judge faithfulness (RAGAS-style claim decomposition) to measure actual factual correctness rather than surface-form overlap.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. KW Overlap 53% vs 75% threshold
&lt;/h3&gt;

&lt;p&gt;This one is partly self-inflicted. Before the word-cap fix, KW overlap was 83% — answers were long enough to include all expected keywords. After the 150-word cap, shorter correct answers naturally contain fewer words, including some expected keywords that dropped out. The keyword set was calibrated for 200-word answers. Two options: tighten to 2–3 high-signal keywords per question, or weight by TF-IDF importance so that high-information terms count more.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Results Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Track&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Hit@1&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 80%&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Hit@5&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 95%&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;MRR@5&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.883&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 0.75&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Hallucination (CW)&lt;/td&gt;
&lt;td&gt;~41%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≤ 25%&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Ctx Coverage&lt;/td&gt;
&lt;td&gt;59%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 65%&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;KW Overlap&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 75%&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Avg Words&lt;/td&gt;
&lt;td&gt;219&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;97&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≤ 150&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unanswerable&lt;/td&gt;
&lt;td&gt;Rejection Rate&lt;/td&gt;
&lt;td&gt;unmeasured&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 90%&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Scope note: one site, 8 questions. These are directional signals, not a production-grade benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cross-encoder re-ranking&lt;/strong&gt; - replace bi-encoder-only ranking with a &lt;code&gt;ms-marco-MiniLM-L-6-v2&lt;/code&gt; cross-encoder as a second-pass re-ranker. Expected Hit@1 improvement: 80% → 90%+.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-judge faithfulness&lt;/strong&gt; - RAGAS-style: decompose each answer into atomic claims and verify each claim against retrieved chunks. Slower and costs tokens but measures actual correctness instead of token overlap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer-length calibration&lt;/strong&gt; - run the eval at word caps of 100/125/150/175 and plot hallucination (CW) vs KW overlap. Find the Pareto-optimal cap where both pass threshold simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keyword set recalibration&lt;/strong&gt; - reduce to 2–3 high-signal terms per question, or adopt TF-IDF weighting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code and Demo
&lt;/h2&gt;

&lt;p&gt;GitHub repo: &lt;a href="https://github.com/shivashrestha/web-intelligence" rel="noopener noreferrer"&gt;web-intelligence&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Live demo: &lt;a href="https://web-intelligence-red.vercel.app" rel="noopener noreferrer"&gt;web-intelligence-red.vercel.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The eval notebook is at &lt;code&gt;backend/rag_eval_single.ipynb&lt;/code&gt;. Results JSON written to &lt;code&gt;data/eval_single_&amp;lt;site&amp;gt;_&amp;lt;date&amp;gt;.json&lt;/code&gt; on each run.&lt;/p&gt;

&lt;p&gt;If you've built RAG eval harnesses and hit similar issues, especially the verbosity/hallucination conflation, I'd like to hear how you handled it ☺️.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
