<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rajat Sharma</title>
    <description>The latest articles on DEV Community by Rajat Sharma (@rajat_sharma_370a62b67b15).</description>
    <link>https://dev.to/rajat_sharma_370a62b67b15</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3465732%2Fd3c69abe-a4c7-41f1-b1e0-d8b444e8509f.jpg</url>
      <title>DEV Community: Rajat Sharma</title>
      <link>https://dev.to/rajat_sharma_370a62b67b15</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rajat_sharma_370a62b67b15"/>
    <language>en</language>
    <item>
      <title>The 3 test framework I use for MCP servers</title>
      <dc:creator>Rajat Sharma</dc:creator>
      <pubDate>Thu, 05 Mar 2026 09:06:15 +0000</pubDate>
      <link>https://dev.to/rajat_sharma_370a62b67b15/the-3-test-framework-i-use-for-mcp-servers-1c98</link>
      <guid>https://dev.to/rajat_sharma_370a62b67b15/the-3-test-framework-i-use-for-mcp-servers-1c98</guid>
      <description>&lt;p&gt;MCP servers are easy to wire up. You export a tool, define a schema, connect a client, done. Then Claude generates the wrong arguments, the handler silently misroutes an edge case, and the judge returns 0.9 for an empty string response.&lt;/p&gt;

&lt;p&gt;The protocol layer rarely gets tested properly. Here's the 3 test framework I use. Each one catches a different class of failure.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Schema Contract — Did someone rename a field?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What breaks:&lt;/strong&gt; A developer changes &lt;code&gt;groundTruth&lt;/code&gt; to &lt;code&gt;ground_truth&lt;/code&gt; in the tool schema. Every MCP caller breaks silently. No type error at runtime, no warning, just wrong behaviour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to test:&lt;/strong&gt; Import the exported schema object directly and assert on its shape. No server boot, no network call, no mocks. Pure property assertions on the contract.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveProperty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llm_response&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveProperty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;groundTruth&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;criteria&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;required&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toContain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llm_response&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;required&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toContain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;groundTruth&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;required&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toContain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;criteria&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a field is renamed or retyped, this fails before any integration test wastes time booting. It's the cheapest regression guard in the stack.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdy9nnwughi0gf2z63lq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdy9nnwughi0gf2z63lq.png" alt="Schema Contract Test Results" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Tool Behaviour — Does the protocol layer handle edge cases?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What breaks:&lt;/strong&gt; The MCP server handler has routing or edge-case bugs independent of the LLM. An empty string input shouldn't score 0.9. A judge crash shouldn't return garbage — it should surface as a structured error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to test:&lt;/strong&gt; Wire MCP Client → Server through &lt;code&gt;InMemoryTransport&lt;/code&gt; (no network). Replace the real Claude judge with a &lt;code&gt;vi.fn()&lt;/code&gt; mock. This isolates the protocol layer completely from LLM variability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Test 1 — empty string input
llm_response: ""
Expected: score &amp;lt; 0.2
Got:      overall: 0.9  ✗ FAIL

Root cause: assertion threshold was &amp;gt; 1 — impossible on a 0–1 scale.
Fix: expect(result.overall).toBeLessThan(0.2)

Test 2 — judge throws "Claude API is down"
Expected: { error: "Claude API is down", isError: true }
Got:      { error: "Claude API is down", isError: true }  ✓ PASS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The failing test here isn't a protocol bug. It's a broken assertion — the threshold was set to &lt;code&gt;&amp;gt; 1&lt;/code&gt;, which no valid score can ever satisfy. This test suite found it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fippz5w402zluxczjpouf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fippz5w402zluxczjpouf.png" alt="Tool Behavior Test Results" width="800" height="580"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. LLM in Loop — Does Claude actually generate the right arguments?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What breaks:&lt;/strong&gt; The schema tests pass, the transport tests pass, but Claude generates wrong tool arguments when it sees the real tool list. Or the judge is miscalibrated and scores a clearly wrong answer at 0.9. The only way to catch this is to run the full round-trip with real inputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to test:&lt;/strong&gt; Send Claude a message via the Anthropic API. Claude reads the tool list from the MCP server, decides to call &lt;code&gt;analyze_response_quality&lt;/code&gt;, and generates its own input arguments. The test doesn't control what Claude sends. Then assert on the scores.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Known-good: "Photosynthesis is the process by which plants use sunlight,
             water, and CO2 to produce glucose and oxygen."
→ overall: 0.95  (accuracy: 1.0, relevance: 0.9)  assert ≥ 0.8  ✓

Known-bad: "Photosynthesis is when plants absorb soil nutrients to grow."
→ overall: 0.05  (accuracy: 0.0, relevance: 0.1)  assert ≤ 0.4  ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full path: user message → Claude → &lt;code&gt;tool_use: analyze_response_quality&lt;/code&gt; → MCP Client → InMemoryTransport → MCP Server → &lt;code&gt;callClaudeJudge&lt;/code&gt; → score back to Claude.&lt;/p&gt;

&lt;p&gt;This catches two things at once: whether Claude generates valid arguments, and whether judge is calibrated correctly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F614anyrkwiutpp2md810.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F614anyrkwiutpp2md810.png" alt="LLM in Loop Test Results" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  I ran this on a real MCP server. Here's what came back.
&lt;/h2&gt;

&lt;p&gt;9 tests total. 1 failed. 89% pass rate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Schema Contract&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Behaviour&lt;/td&gt;
&lt;td&gt;1/2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM in Loop&lt;/td&gt;
&lt;td&gt;2/2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The single failure is in Tool Behaviour, and it's not a server bug. The assertion for the empty string case expects &lt;code&gt;overall &amp;gt; 1&lt;/code&gt; — a threshold that's impossible to satisfy on a 0–1 scale. The mock judge returned 0.9, which is actually wrong behavior (empty input should score near zero), but the test would have failed regardless of what score came back. Two bugs in one: an over-optimistic mock and a broken assertion.&lt;/p&gt;

&lt;p&gt;The fix is two lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// wrong&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;overall&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeGreaterThan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// right — empty input should score low&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;overall&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeLessThan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why three separate layers?
&lt;/h2&gt;

&lt;p&gt;Each test catches a different class of bug. Run all three before shipping.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it tests&lt;/th&gt;
&lt;th&gt;Needs API?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Schema Contract&lt;/td&gt;
&lt;td&gt;Tool definition shape&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Behavior&lt;/td&gt;
&lt;td&gt;MCP protocol + handler&lt;/td&gt;
&lt;td&gt;No (mock judge)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM in Loop&lt;/td&gt;
&lt;td&gt;Claude's argument generation + judge calibration&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Schema fails fast and cheap. Tool Behavior catches protocol bugs without burning API credits. LLM in Loop is the only one that validates actual end-to-end behavior — Claude reading the tool list, generating arguments, and getting a meaningful score back.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>mcp</category>
      <category>testing</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Your RAG pipeline is probably lying to you (here's how to test it)</title>
      <dc:creator>Rajat Sharma</dc:creator>
      <pubDate>Tue, 03 Mar 2026 11:43:26 +0000</pubDate>
      <link>https://dev.to/rajat_sharma_370a62b67b15/your-rag-pipeline-is-probably-lying-to-you-heres-how-to-test-it-51nf</link>
      <guid>https://dev.to/rajat_sharma_370a62b67b15/your-rag-pipeline-is-probably-lying-to-you-heres-how-to-test-it-51nf</guid>
      <description>&lt;p&gt;I've seen a lot of RAG setups that look great in demos. Clean answers, fast responses, confident tone. Then something breaks silently and nobody notices until a user gets a wrong answer served with full confidence.&lt;/p&gt;

&lt;p&gt;The problem isn't the LLM. It's that the RAG layer rarely gets tested properly.&lt;/p&gt;

&lt;p&gt;Here's the 5-test framework I use. Each one catches a different class of failure.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Faithfulness — Is the answer grounded in your docs?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What breaks:&lt;/strong&gt; The model answers correctly... but pulls from its training memory, not your knowledge base. You can't audit it. You can't trust it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to test:&lt;/strong&gt; Use an LLM-as-judge to map every claim in the answer back to a retrieved chunk. Score = &lt;code&gt;supported_claims / total_claims&lt;/code&gt;. Set a threshold and fail anything below it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Answer: "Minimum password length is 16 characters"
Chunk:  "Passwords must be a minimum of 16 characters..."
→ Claim supported ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the answer says "12 characters (industry standard)" and your doc says 16, that's hallucination. Even if 12 is technically reasonable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwnt15mr1cnnv923w9cnz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwnt15mr1cnnv923w9cnz.png" alt="Test Result Faithfulness" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Context Precision — Are you even retrieving the right chunks?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What breaks:&lt;/strong&gt; Garbage in, garbage out. Your retriever pulls irrelevant docs and the LLM does its best with bad context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to test:&lt;/strong&gt; For each query, score the relevance of every retrieved chunk (0 to 1) using an LLM judge. If less than 2 out of 3 chunks score above your threshold, your embedding/retrieval setup has alignment issues.&lt;/p&gt;

&lt;p&gt;This one catches problems that faithfulness testing misses. It checks &lt;em&gt;before&lt;/em&gt; the answer is generated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh181gd3t9svi6cmsutq2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh181gd3t9svi6cmsutq2.png" alt="Test Result Precision" width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Negative Testing — Does it know what it &lt;em&gt;doesn't&lt;/em&gt; know?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What breaks:&lt;/strong&gt; Someone asks about something not in your KB. The model fills the gap with training data and answers confidently. In compliance, legal, or medical contexts, this is genuinely dangerous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to test:&lt;/strong&gt; Write a list of questions that are deliberately &lt;em&gt;outside&lt;/em&gt; your knowledge base. For each one, check if the response contains a refusal phrase like "I don't have information about..." and if it doesn't, you've caught a live hallucination.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query: "How many vacation days do employees get?" (not in KB)

✓ Pass: "I don't have that information in the knowledge base."
✗ Fail: "Employees typically receive 15 days per year."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple string matching. No LLM needed. Fast and deterministic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx4mwqifj8xgu30ycwkd9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx4mwqifj8xgu30ycwkd9.png" alt="Test Result Negative Case" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Retrieval Unit Test — Did your re-index break anything?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What breaks:&lt;/strong&gt; You swap your embedding model, change chunk size, or re-index. Now a query that used to find the right doc doesn't anymore. No errors thrown. Pipeline looks fine. It's just wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to test:&lt;/strong&gt; Maintain a ground-truth lookup: &lt;code&gt;query -&amp;gt; expected doc ID&lt;/code&gt;. After every infrastructure change, run this. Check that the expected doc ID appears somewhere in your top-K results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query: "What is the minimum password length?"
Expected: doc-003-security in top-3
Got: [doc-003-security (0.94), doc-001 (0.41), doc-002 (0.38)] ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No LLM needed. Pure regression testing for your retrieval layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1836vgzmpihbixho9y6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1836vgzmpihbixho9y6.png" alt="Test Result Regression" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Stale Data — Did your update actually propagate?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What breaks:&lt;/strong&gt; You update a policy doc and re-ingest it. But vector stores don't automatically replace old embeddings. They &lt;em&gt;add&lt;/em&gt; the new one alongside the old one. Now both exist. Queries return either version non-deterministically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to test:&lt;/strong&gt; Two phases.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1:&lt;/strong&gt; Ingest v1, query, confirm old value ("90 days") is returned&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2:&lt;/strong&gt; Clear the collection, ingest v2, query, confirm new value ("60 days") is returned AND old value is absent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you skip the clear step in phase 2, you'll reproduce the bug right in your test suite.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;v2Answer&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\b&lt;/span&gt;&lt;span class="sr"&gt;60&lt;/span&gt;&lt;span class="se"&gt;\b&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;not&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RegExp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;b&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;stale&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;assertion&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;b`&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjpse4pa0m8n77r5zj4n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjpse4pa0m8n77r5zj4n.png" alt="Test Result Stale" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  I ran this on a real KB. Here's what came back.
&lt;/h2&gt;

&lt;p&gt;14 tests total. 4 failed. 78% pass rate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Faithfulness&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Precision&lt;/td&gt;
&lt;td&gt;0/3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Negative&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval Unit&lt;/td&gt;
&lt;td&gt;4/4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stale Data&lt;/td&gt;
&lt;td&gt;1/1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here's what the failures actually look like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context precision failed all 3 tests with the same pattern.&lt;/strong&gt; Every query retrieved exactly 1 relevant chunk out of 3. The right document always ranked first, but the similarity scores were too close together. For a password policy query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;doc-003-security      0.374  relevant
doc-002-remote-work   0.209  irrelevant
doc-005-reimbursement 0.191  irrelevant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That 0.165 gap between the right doc and the wrong ones isn't confidence, it's noise. The retriever is finding the right doc by a slim margin. The answers look fine because the LLM is working around weak retrieval. That won't hold as the KB grows.&lt;/p&gt;




&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Most RAG failures aren't spectacular crashes. They're silent. Confident wrong answers. Stale policy info. Hallucinated numbers. A retriever that quietly regressed after a model swap.&lt;/p&gt;

&lt;p&gt;These 5 tests cover the whole pipeline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generation&lt;/td&gt;
&lt;td&gt;Faithfulness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval quality&lt;/td&gt;
&lt;td&gt;Context Precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Out-of-scope handling&lt;/td&gt;
&lt;td&gt;Negative&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval regression&lt;/td&gt;
&lt;td&gt;Unit Test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Update propagation&lt;/td&gt;
&lt;td&gt;Stale Data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
    </item>
  </channel>
</rss>
