I Got Tired of Debugging Haystack RAG Pipelines Blind, So I Built a Diagnostics Engine

#rag #python #opensource #genai

RAG pipelines fail in quiet ways.

Retrieval drops. Documents go missing. Metadata gets corrupted somewhere between ingestion and query time. Your generator starts hallucinating and you don't know if it's the retriever, the document store, or something upstream.

The debugging loop is always the same: check traces, grep logs, write a one-off script to inspect the document store, try to diff two runs manually. It works, but it's slow and it doesn't scale.

I hit this enough times while working on a Haystack 2.x pipeline at my internship that I started building something to systematize it.

That became Haystack Diagnostics Engine.

What it actually does

Four things:

Document store validation — checks your vector store for duplicate chunks, missing metadata fields, and short/malformed documents before they silently degrade retrieval quality.

Pipeline introspection — inspects your Haystack pipeline structure, flags misconfigurations, and can visualize the component graph. Useful when you're inheriting a pipeline someone else built.

Retrieval failure classification — when a query returns garbage, this tells you why. Six failure classes: empty results, low-score results, metadata filter mismatch, reranker collapse, score inversion, and retriever timeout. Each has a different fix. Uses Haystack's include_outputs_from for single-pass retriever/reranker diagnostics without re-running the pipeline.

Debug bundle capture and diffing — this is the one I've gotten the most feedback on.

Debug bundles: the part that actually changed my workflow

The typical production debugging scenario: something worked last week, it doesn't work now, and you have no idea what changed.

collect_debug_bundle(pipeline, query, ...) captures the full state of a single query execution as a structured JSON file:

Pipeline graph, Haystack version, component init_parameters
Raw retriever top-k (pre-reranker) — scores, metadata, content previews
Reranked top-k when a reranker is detected
Prompt snapshot and generated answer
Failure classification result
Corpus health checks scoped to only the retrieved document IDs, not the full store

Bundle filenames are {query_slug}_{timestamp}.json — human-readable, sort naturally across runs of the same query. The UUID lives inside the JSON, not in the filename.

diff_debug_bundles(bundle_a, bundle_b) compares two persisted bundles and reports:

Score deltas per document
Docs that appeared or disappeared between runs
Component config changes between the two pipeline states
Character-level answer diff via difflib (no tokenizer dependency)

Config diffs filter out known volatile fields by default — things like InMemoryDocumentStore.index, which regenerates as a random UUID on instantiation and would create false positives on every diff. You can pass ignore_config_paths=set() to disable filtering or extend the defaults with component-specific paths like "retriever.session_id".

There's also a CLI:

python -m diagnostics.debug_bundler diff bundle_a.json bundle_b.json

The workflow this enables: run a query, persist the bundle, deploy a change, run the same query again, diff the two bundles. You get an exact record of what shifted — scores, docs, config, answer — without relying on memory or logs.

What it found on a real deployment

I ran the validator against a live Weaviate-backed RAG instance with 823 chunks and OpenAI embeddings.

195 duplicate chunks (23.7% of the corpus)
14 documents missing required metadata keys
8 anomalous short chunks under the minimum threshold

None of these were obvious from the outside. The pipeline was running, queries were returning results, everything looked fine. The duplicates were inflating retrieval scores for certain topics. The missing metadata was breaking a filter that wasn't catching the error gracefully.

The MCP server benchmarks at ~0.95s for 15 concurrent graph-inspection requests.

Why MCP

I wanted this to be composable, not just another CLI tool you run once and forget.

Wrapping it as an MCP server means you can call validate_document_store, inspect_pipeline, diagnose_retrieval_failure, or collect_debug_bundle directly from Claude Desktop or any MCP-compatible client during a debugging session. The context stays in one place instead of jumping between terminals and notebooks.

Current state

Weaviate support is solid. Qdrant and Pinecone support is in progress. The project has been cloned by 90+ developers since I published it, which was surprising for something this niche.

If you're using a different document store and want to add a backend, contributions are open.

GitHub: https://github.com/rautaditya2606/haystack-diagnostics

Feedback welcome, especially if you hit a retrieval failure mode the engine doesn't classify correctly yet.