The problem with end-to-end RAG eval
I had a working document retrieval pipeline. Fixed-size chunking, TF-IDF embeddings, FAISS index. Recall@10 was 0.82 on SciFact. Good enough.
Then I made one change: I swapped fixed-size chunking for sentence-based chunking. Recall dropped to 0.68.
My first instinct was to roll back. But I wanted to understand why. End-to-end eval only told me "retrieval is worse." It couldn't tell me which stage was responsible.
The debugging approach
I restructured the pipeline so each stage can be evaluated independently. The pipeline is expressed as a string feature chain:
from mloda.user import mlodaAPI, PluginCollector
# The full pipeline: each __ is a stage boundary
results = mlodaAPI.run_all(
features=["docs__pii_redacted__chunked__deduped__embedded"],
...
)
Stop at chunking? "docs__pii_redacted__chunked".
Skip dedup? "docs__pii_redacted__chunked__embedded".
Add evaluation? "docs__pii_redacted__chunked__deduped__embedded__evaluation".
Each stage is a self-contained plugin. Here's what debugging looked like:
Step 1: Inspect chunking output directly. Sentence chunks averaged 45 tokens vs. 512 for fixed-size. Looked reasonable. Not the problem.
Step 2: Check dedup. Shorter chunks meant more near-duplicates. Exact hash dedup only catches identical chunks, so near-duplicates passed through.
Step 3: Swap dedup method.
from rag_integration.feature_groups.rag_pipeline import NgramDeduplicator
providers = {
...,
NgramDeduplicator, # was ExactHashDeduplicator
}
results = mlodaAPI.run_all(
features=["docs__pii_redacted__chunked__deduped__embedded__evaluation"],
compute_frameworks={PythonDictFramework},
plugin_collector=PluginCollector.enabled_feature_groups(providers),
)
# Recall went back to 0.81
The root cause was never the chunker. The chunker's output exposed a weakness in the downstream dedup stage.
Why this matters
RAG pipelines are a chain of dependent stages. Changing one stage can break a different stage for reasons that are invisible in end-to-end metrics.
Stage-by-stage eval turns debugging from "something is wrong somewhere" into "this specific stage degrades here."
What the pipeline supports
The pipeline makes every stage swappable:
| Stage | Options |
|---|---|
| PII redaction | regex, presidio, custom patterns |
| Chunking | fixed-size, sentence, paragraph, semantic |
| Deduplication | exact hash, normalized, n-gram |
| Embedding | TF-IDF, sentence-transformers, hash |
| Vector index | FAISS flat, IVF, HNSW |
Built-in eval metrics: Recall@K, Precision, NDCG, MAP against BEIR benchmarks.
There's also an image pipeline with the same structure: PII redaction (blur/pixelate/fill), perceptual hash dedup, and CLIP embeddings.
Try it
Not everything presented here is working yet, but most of it is. We are figuring out if this is interesting or rather not worth reading/talking about.
Open source under Apache 2.0: https://github.com/mloda-ai/rag_integration
If you've hit the "swap one component, break something else" problem in your own pipelines, I'd be curious to hear how you approached debugging it.
Top comments (0)