<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aditya Raut</title>
    <description>The latest articles on DEV Community by Aditya Raut (@rautaditya2606).</description>
    <link>https://dev.to/rautaditya2606</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2747140%2F696730cd-b32e-4fc6-ad1b-d6c8e7bf9df7.png</url>
      <title>DEV Community: Aditya Raut</title>
      <link>https://dev.to/rautaditya2606</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rautaditya2606"/>
    <language>en</language>
    <item>
      <title>I Got Tired of Debugging Haystack RAG Pipelines Blind, So I Built a Diagnostics Engine</title>
      <dc:creator>Aditya Raut</dc:creator>
      <pubDate>Sat, 27 Jun 2026 11:40:13 +0000</pubDate>
      <link>https://dev.to/rautaditya2606/i-got-tired-of-debugging-haystack-rag-pipelines-blind-so-i-built-a-diagnostics-engine-2g0o</link>
      <guid>https://dev.to/rautaditya2606/i-got-tired-of-debugging-haystack-rag-pipelines-blind-so-i-built-a-diagnostics-engine-2g0o</guid>
      <description>&lt;p&gt;RAG pipelines fail in quiet ways.&lt;/p&gt;

&lt;p&gt;Retrieval drops. Documents go missing. Metadata gets corrupted somewhere between ingestion and query time. Your generator starts hallucinating and you don't know if it's the retriever, the document store, or something upstream.&lt;/p&gt;

&lt;p&gt;The debugging loop is always the same: check traces, grep logs, write a one-off script to inspect the document store, try to diff two runs manually. It works, but it's slow and it doesn't scale.&lt;/p&gt;

&lt;p&gt;I hit this enough times while working on a Haystack 2.x pipeline at my internship that I started building something to systematize it.&lt;/p&gt;

&lt;p&gt;That became &lt;a href="https://github.com/rautaditya2606/haystack-diagnostics" rel="noopener noreferrer"&gt;Haystack Diagnostics Engine&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it actually does
&lt;/h2&gt;

&lt;p&gt;Four things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document store validation&lt;/strong&gt; — checks your vector store for duplicate chunks, missing metadata fields, and short/malformed documents before they silently degrade retrieval quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline introspection&lt;/strong&gt; — inspects your Haystack pipeline structure, flags misconfigurations, and can visualize the component graph. Useful when you're inheriting a pipeline someone else built.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval failure classification&lt;/strong&gt; — when a query returns garbage, this tells you &lt;em&gt;why&lt;/em&gt;. Six failure classes: empty results, low-score results, metadata filter mismatch, reranker collapse, score inversion, and retriever timeout. Each has a different fix. Uses Haystack's &lt;code&gt;include_outputs_from&lt;/code&gt; for single-pass retriever/reranker diagnostics without re-running the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debug bundle capture and diffing&lt;/strong&gt; — this is the one I've gotten the most feedback on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Debug bundles: the part that actually changed my workflow
&lt;/h2&gt;

&lt;p&gt;The typical production debugging scenario: something worked last week, it doesn't work now, and you have no idea what changed.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;collect_debug_bundle(pipeline, query, ...)&lt;/code&gt; captures the full state of a single query execution as a structured JSON file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline graph, Haystack version, component &lt;code&gt;init_parameters&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Raw retriever top-k (pre-reranker) — scores, metadata, content previews&lt;/li&gt;
&lt;li&gt;Reranked top-k when a reranker is detected&lt;/li&gt;
&lt;li&gt;Prompt snapshot and generated answer&lt;/li&gt;
&lt;li&gt;Failure classification result&lt;/li&gt;
&lt;li&gt;Corpus health checks scoped to only the retrieved document IDs, not the full store&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bundle filenames are &lt;code&gt;{query_slug}_{timestamp}.json&lt;/code&gt; — human-readable, sort naturally across runs of the same query. The UUID lives inside the JSON, not in the filename.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;diff_debug_bundles(bundle_a, bundle_b)&lt;/code&gt; compares two persisted bundles and reports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Score deltas per document&lt;/li&gt;
&lt;li&gt;Docs that appeared or disappeared between runs&lt;/li&gt;
&lt;li&gt;Component config changes between the two pipeline states&lt;/li&gt;
&lt;li&gt;Character-level answer diff via &lt;code&gt;difflib&lt;/code&gt; (no tokenizer dependency)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Config diffs filter out known volatile fields by default — things like &lt;code&gt;InMemoryDocumentStore.index&lt;/code&gt;, which regenerates as a random UUID on instantiation and would create false positives on every diff. You can pass &lt;code&gt;ignore_config_paths=set()&lt;/code&gt; to disable filtering or extend the defaults with component-specific paths like &lt;code&gt;"retriever.session_id"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There's also a CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; diagnostics.debug_bundler diff bundle_a.json bundle_b.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The workflow this enables: run a query, persist the bundle, deploy a change, run the same query again, diff the two bundles. You get an exact record of what shifted — scores, docs, config, answer — without relying on memory or logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it found on a real deployment
&lt;/h2&gt;

&lt;p&gt;I ran the validator against a live Weaviate-backed RAG instance with 823 chunks and OpenAI embeddings.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;195 duplicate chunks (23.7% of the corpus)&lt;/li&gt;
&lt;li&gt;14 documents missing required metadata keys&lt;/li&gt;
&lt;li&gt;8 anomalous short chunks under the minimum threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these were obvious from the outside. The pipeline was running, queries were returning results, everything looked fine. The duplicates were inflating retrieval scores for certain topics. The missing metadata was breaking a filter that wasn't catching the error gracefully.&lt;/p&gt;

&lt;p&gt;The MCP server benchmarks at ~0.95s for 15 concurrent graph-inspection requests.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why MCP
&lt;/h2&gt;

&lt;p&gt;I wanted this to be composable, not just another CLI tool you run once and forget.&lt;/p&gt;

&lt;p&gt;Wrapping it as an MCP server means you can call &lt;code&gt;validate_document_store&lt;/code&gt;, &lt;code&gt;inspect_pipeline&lt;/code&gt;, &lt;code&gt;diagnose_retrieval_failure&lt;/code&gt;, or &lt;code&gt;collect_debug_bundle&lt;/code&gt; directly from Claude Desktop or any MCP-compatible client during a debugging session. The context stays in one place instead of jumping between terminals and notebooks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Current state
&lt;/h2&gt;

&lt;p&gt;Weaviate support is solid. Qdrant and Pinecone support is in progress. The project has been cloned by 90+ developers since I published it, which was surprising for something this niche.&lt;/p&gt;

&lt;p&gt;If you're using a different document store and want to add a backend, contributions are open.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/rautaditya2606/haystack-diagnostics" rel="noopener noreferrer"&gt;https://github.com/rautaditya2606/haystack-diagnostics&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback welcome, especially if you hit a retrieval failure mode the engine doesn't classify correctly yet.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>python</category>
      <category>opensource</category>
      <category>genai</category>
    </item>
    <item>
      <title>Building a Voice-Controlled Local AI Agent on a 4GB GPU</title>
      <dc:creator>Aditya Raut</dc:creator>
      <pubDate>Sun, 12 Apr 2026 20:57:55 +0000</pubDate>
      <link>https://dev.to/rautaditya2606/building-a-voice-controlled-local-ai-agent-on-a-4gb-gpu-emc</link>
      <guid>https://dev.to/rautaditya2606/building-a-voice-controlled-local-ai-agent-on-a-4gb-gpu-emc</guid>
      <description>&lt;p&gt;&lt;strong&gt;What I Built&lt;/strong&gt;&lt;br&gt;
I built a voice-controlled local AI agent that transcribes &lt;br&gt;
audio, classifies intent, and executes local tools — all &lt;br&gt;
visible through a transparent pipeline trace in a Gradio UI.&lt;br&gt;
The agent supports four intents: create file, write code, &lt;br&gt;
summarize text, and general chat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;br&gt;
STT layer: Groq Whisper-large-v3 handles transcription via API.&lt;br&gt;
I chose Groq over local Whisper because my RTX 3050 (4GB VRAM) &lt;br&gt;
cannot run STT and an LLM simultaneously without OOM errors. &lt;br&gt;
Groq's API is actually faster (~300ms) than local whisper-small &lt;br&gt;
would have been.&lt;/p&gt;

&lt;p&gt;Intent layer: Ollama serves qwen2.5-coder:1.5b locally. The LLM &lt;br&gt;
returns a structured JSON intent that the tool router uses to &lt;br&gt;
decide which action to take.&lt;/p&gt;

&lt;p&gt;Tool layer: Four tools — create_file, write_code, summarize, &lt;br&gt;
general_chat. All file writes are sandboxed to output/.&lt;/p&gt;

&lt;p&gt;UI layer: Gradio displays transcription, detected intent, action &lt;br&gt;
taken, and a full pipeline trace with per-stage latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware Constraints and Decisions&lt;/strong&gt; &lt;br&gt;
My machine: Intel i5-12500H, RTX 3050 (4GB VRAM), 15GB RAM.&lt;/p&gt;

&lt;p&gt;The core constraint: 4GB VRAM cannot hold both a Whisper model &lt;br&gt;
and an LLM simultaneously.&lt;/p&gt;

&lt;p&gt;Decision 1 — STT via Groq API&lt;br&gt;
Running whisper-small locally uses ~1.5GB VRAM. That leaves &lt;br&gt;
only 2.5GB for the LLM, which isn't enough for a useful model. &lt;br&gt;
Offloading STT to Groq frees the entire 4GB for the LLM and &lt;br&gt;
actually improves latency.&lt;/p&gt;

&lt;p&gt;Decision 2 — qwen2.5-coder:1.5b via Ollama&lt;br&gt;
A 1.5B model at Q4 quantization fits comfortably in ~1.5GB VRAM.&lt;br&gt;
I initially tried the 7b variant but it exceeded available VRAM &lt;br&gt;
and caused Ollama to offload to RAM, significantly slowing &lt;br&gt;
inference.&lt;/p&gt;

&lt;p&gt;Decision 3 — Sequential pipeline&lt;br&gt;
STT completes before Ollama is called. This keeps peak VRAM &lt;br&gt;
usage under 2GB at any given time.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Challenges I Faced *&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;VRAM management&lt;br&gt;
Loading two models simultaneously caused OOM errors. Solved &lt;br&gt;
by switching STT to Groq and keeping only the LLM local.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Intent JSON parsing&lt;br&gt;
Ollama sometimes returns malformed JSON or wraps it in &lt;br&gt;
markdown code fences. Solved with a robust parser that &lt;br&gt;
strips fences and falls back to keyword matching if JSON &lt;br&gt;
parsing fails entirely.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Output sandboxing&lt;br&gt;
Naive file creation allowed path traversal (e.g. &lt;br&gt;
../../etc/passwd). Solved with path normalization and &lt;br&gt;
checking that the resolved path starts with the output/ &lt;br&gt;
directory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gradio mic input format&lt;br&gt;
Gradio returns audio as a tuple (sample_rate, numpy_array) &lt;br&gt;
not a file path. Had to write it to a temp file before &lt;br&gt;
passing to Groq API.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What I'd Do Differently at Scale&lt;/strong&gt;&lt;br&gt;
For a production version of this system, I would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace Ollama with Triton Inference Server for proper 
model serving with batching and metrics endpoints.&lt;/li&gt;
&lt;li&gt;Add a message queue (Redis or RabbitMQ) between the UI 
and pipeline so multiple users don't block each other.&lt;/li&gt;
&lt;li&gt;Replace the flat logger with structured JSON logs shipped 
to an observability stack (Grafana + Loki).&lt;/li&gt;
&lt;li&gt;Add model versioning — config.yaml currently hardcodes 
model names. A proper MLOps setup uses a model registry.&lt;/li&gt;
&lt;li&gt;Containerize STT locally using a sidecar so the pipeline 
has no external API dependency in production.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Model Benchmarking
&lt;/h2&gt;

&lt;p&gt;I added a benchmarking tab — set models, prompt, iterations,&lt;br&gt;
get a latency table back.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5-coder:1.5b&lt;/td&gt;
&lt;td&gt;~3.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5-coder:7b&lt;/td&gt;
&lt;td&gt;~11.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For structured JSON intent extraction, the 1.5b model is&lt;br&gt;
3-4x faster with no meaningful accuracy difference. For a&lt;br&gt;
constrained task like this, bigger isn't better.&lt;/p&gt;
&lt;h2&gt;
  
  
  Persistent Memory
&lt;/h2&gt;

&lt;p&gt;Every pipeline run is stored in SQLite — transcription,&lt;br&gt;
intent, action, output, and trace. Surfaces in the UI as&lt;br&gt;
a recent runs panel.&lt;/p&gt;

&lt;p&gt;This matters for two reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Debugging&lt;/strong&gt; — if intent classification goes wrong, you
can see exactly what transcription and JSON the LLM
returned&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditability&lt;/strong&gt; — every file written has a corresponding
memory entry with the voice command that triggered it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Simple schema, append-only, no ORM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="n"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;EXISTS&lt;/span&gt; &lt;span class="nf"&gt;runs &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;        &lt;span class="n"&gt;INTEGER&lt;/span&gt; &lt;span class="n"&gt;PRIMARY&lt;/span&gt; &lt;span class="n"&gt;KEY&lt;/span&gt; &lt;span class="n"&gt;AUTOINCREMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt;    &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;    &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;    &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trace&lt;/span&gt;     &lt;span class="n"&gt;TEXT&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Links&lt;br&gt;
GitHub: &lt;a href="https://github.com/rautaditya2606/Aditya_Raut_Mem0_AI" rel="noopener noreferrer"&gt;https://github.com/rautaditya2606/Aditya_Raut_Mem0_AI&lt;/a&gt;&lt;br&gt;
Demo: &lt;a href="https://youtu.be/rhGIQvi4Y74" rel="noopener noreferrer"&gt;https://youtu.be/rhGIQvi4Y74&lt;/a&gt;  &lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
