<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kaltofen</title>
    <description>The latest articles on DEV Community by Kaltofen (@coldoven).</description>
    <link>https://dev.to/coldoven</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869877%2F34aac0c8-6ee5-4425-892e-96077ce2d0b7.jpg</url>
      <title>DEV Community: Kaltofen</title>
      <link>https://dev.to/coldoven</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/coldoven"/>
    <language>en</language>
    <item>
      <title>Why I debug my RAG pipeline stage by stage, not end to end</title>
      <dc:creator>Kaltofen</dc:creator>
      <pubDate>Thu, 09 Apr 2026 13:24:29 +0000</pubDate>
      <link>https://dev.to/coldoven/why-i-debug-my-rag-pipeline-stage-by-stage-not-end-to-end-1faf</link>
      <guid>https://dev.to/coldoven/why-i-debug-my-rag-pipeline-stage-by-stage-not-end-to-end-1faf</guid>
      <description>&lt;h2&gt;
  
  
  The problem with end-to-end RAG eval
&lt;/h2&gt;

&lt;p&gt;I had a working document retrieval pipeline. Fixed-size chunking, TF-IDF embeddings, FAISS index. Recall@10 was 0.82 on SciFact. Good enough.&lt;/p&gt;

&lt;p&gt;Then I made one change: I swapped fixed-size chunking for sentence-based chunking. Recall dropped to 0.68.&lt;/p&gt;

&lt;p&gt;My first instinct was to roll back. But I wanted to understand &lt;em&gt;why&lt;/em&gt;. End-to-end eval only told me "retrieval is worse." It couldn't tell me which stage was responsible.&lt;/p&gt;

&lt;h2&gt;
  
  
  The debugging approach
&lt;/h2&gt;

&lt;p&gt;I restructured the pipeline so each stage can be evaluated independently. The pipeline is expressed as a string feature chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mloda.user&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlodaAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PluginCollector&lt;/span&gt;

&lt;span class="c1"&gt;# The full pipeline: each __ is a stage boundary
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlodaAPI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs__pii_redacted__chunked__deduped__embedded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stop at chunking? &lt;code&gt;"docs__pii_redacted__chunked"&lt;/code&gt;. &lt;br&gt;
Skip dedup? &lt;code&gt;"docs__pii_redacted__chunked__embedded"&lt;/code&gt;. &lt;br&gt;
Add evaluation? &lt;code&gt;"docs__pii_redacted__chunked__deduped__embedded__evaluation"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Each stage is a self-contained plugin. Here's what debugging looked like:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1:&lt;/strong&gt; Inspect chunking output directly. Sentence chunks averaged 45 tokens vs. 512 for fixed-size. Looked reasonable. Not the problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt; Check dedup. Shorter chunks meant more near-duplicates. Exact hash dedup only catches identical chunks, so near-duplicates passed through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3:&lt;/strong&gt; Swap dedup method.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rag_integration.feature_groups.rag_pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NgramDeduplicator&lt;/span&gt;

&lt;span class="n"&gt;providers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;...,&lt;/span&gt;
    &lt;span class="n"&gt;NgramDeduplicator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# was ExactHashDeduplicator
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlodaAPI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs__pii_redacted__chunked__deduped__embedded__evaluation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;compute_frameworks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PythonDictFramework&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;plugin_collector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PluginCollector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enabled_feature_groups&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;providers&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Recall went back to 0.81
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The root cause was never the chunker. The chunker's output exposed a weakness in the downstream dedup stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;RAG pipelines are a chain of dependent stages. Changing one stage can break a different stage for reasons that are invisible in end-to-end metrics.&lt;/p&gt;

&lt;p&gt;Stage-by-stage eval turns debugging from "something is wrong somewhere" into "this specific stage degrades here."&lt;/p&gt;

&lt;h2&gt;
  
  
  What the pipeline supports
&lt;/h2&gt;

&lt;p&gt;The pipeline makes every stage swappable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Options&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PII redaction&lt;/td&gt;
&lt;td&gt;regex, presidio, custom patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunking&lt;/td&gt;
&lt;td&gt;fixed-size, sentence, paragraph, semantic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deduplication&lt;/td&gt;
&lt;td&gt;exact hash, normalized, n-gram&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding&lt;/td&gt;
&lt;td&gt;TF-IDF, sentence-transformers, hash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector index&lt;/td&gt;
&lt;td&gt;FAISS flat, IVF, HNSW&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Built-in eval metrics: Recall@K, Precision, NDCG, MAP against BEIR benchmarks.&lt;/p&gt;

&lt;p&gt;There's also an image pipeline with the same structure: PII redaction (blur/pixelate/fill), perceptual hash dedup, and CLIP embeddings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Not everything presented here is working yet, but most of it is. We are figuring out if this is interesting or rather not worth reading/talking about.&lt;/p&gt;

&lt;p&gt;Open source under Apache 2.0: &lt;a href="https://github.com/mloda-ai/rag_integration" rel="noopener noreferrer"&gt;https://github.com/mloda-ai/rag_integration&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you've hit the "swap one component, break something else" problem in your own pipelines, I'd be curious to hear how you approached debugging it.&lt;/p&gt;

</description>
      <category>nlp</category>
      <category>python</category>
      <category>rag</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
