Originally published on prodinit.com
Key Takeaways
- 80% of RAG failures trace back to the ingestion layer, not the LLM — fix chunking and indexing before tuning your prompts
- Chunk size alone can swing retrieval precision by 20–40%; there is no universal right answer, and the correct value depends on your document type and query pattern
- Adding a cross-encoder reranker on top of vector search typically lifts answer correctness by 15–25% with minimal latency cost
- Stale indexes are invisible in standard monitoring: a document updated 3 months ago may still be answering queries from its old content
- Teams without an eval loop discover regressions 4–8× slower than teams with automated retrieval quality checks running on every deployment
A RAG pipeline looks straightforward on paper: retrieve relevant chunks, stuff them into a prompt, get an answer. Teams wire it up in a weekend, the demo works, and they ship it. Then, weeks later, users start complaining that the system returns outdated information, misses obvious answers, or confidently cites the wrong document.
RAG pipeline debugging starts at the retrieval layer, not the LLM. The five failure modes that break production RAG systems — bad chunking, missing reranking, stale indexes, no hybrid retrieval, and no eval loop — are all fixable at the data and infrastructure layer. None require changing your model or rewriting your application.
Why RAG Fails Silently in Production
The LLM itself is almost always fine. The retrieval layer is what's broken — and most observability tooling points at the model, not the retriever. You can spend days tweaking system prompts and temperature settings while the root cause sits in how you chunked your documents three months ago. Production RAG failure leaves no stack trace.
There is no exception, no 500 error, no latency spike. The system continues to return answers. They are just wrong, or incomplete, or stale. Without an explicit eval loop tied to retrieval quality, you will not know until a user tells you.
This guide covers the five failure modes Prodinit encounters most often when auditing RAG systems in production, with diagnosis steps and fixes for each.
Failure Mode 1: Bad Chunking Strategy
Fixed-size character splitting destroys retrieval quality for anything beyond plain prose. A 512-token chunk of a legal contract may split a clause mid-sentence; a 512-token chunk of code may span four unrelated functions. Neither produces embeddings specific enough to surface the right document for a precise query.
Why it breaks
Chunking is the most consequential decision in a RAG pipeline and the one teams spend the least time on. The default in most frameworks is a fixed-size character or token split with a small overlap. This works in demos. In production, it destroys retrieval quality for anything that isn't plain prose.
The problem with fixed-size chunking:
- A 512-token chunk of a legal contract may split a clause mid-sentence, leaving neither chunk with enough context to be retrieved correctly
- A 512-token chunk of code may contain four unrelated functions, causing the entire chunk to match queries loosely but none of them precisely
- Tables, structured data, and numbered lists lose their semantics when split by character count
When your chunks are semantically incoherent, your embeddings are noisy. Noisy embeddings produce low-confidence nearest-neighbor results. The retriever returns tangentially related chunks, the LLM hallucinates to fill the gap, and the answer looks plausible but wrong.
Diagnosis
Check what your chunks actually look like:
import json
def audit_chunks(chunks: list[str], sample_size: int = 20) -> dict:
import random
sample = random.sample(chunks, min(sample_size, len(chunks)))
stats = {
"avg_tokens": sum(len(c.split()) for c in chunks) / len(chunks),
"min_tokens": min(len(c.split()) for c in chunks),
"max_tokens": max(len(c.split()) for c in chunks),
"truncated_sentences": sum(
1 for c in sample
if not c.strip().endswith((".", "?", "!", "```
", "}"))
),
"sample": sample[:3],
}
return stats
Red flags: truncated_sentences above 30%, average tokens below 100 or above 600, or chunks that end mid-code-block.
Fix
Switch to semantic chunking. For prose documents, split on sentence boundaries and merge until a semantic similarity threshold is crossed. For structured content, use document-aware splitters that respect headings, tables, and code blocks.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Document-aware splitter that respects structure
splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""],
chunk_size=600, # tokens, not characters
chunk_overlap=60, # ~10% overlap for context continuity
length_function=len,
is_separator_regex=False,
)
# For code: use language-aware splitters
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
code_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=800,
chunk_overlap=80,
)
There is no universal correct chunk size. Run retrieval precision benchmarks at 256, 512, and 1024 tokens against a sample of real queries. Pick the size that maximises the percentage of queries where the correct answer appears in the top-3 retrieved chunks.
Failure Mode 2: Missing Reranking
Vector similarity search retrieves the right candidates but ranks them poorly. The chunk with the highest cosine similarity is not always the most useful chunk for the specific query — it is the closest in embedding space, not the most relevant to the question. Without a cross-encoder reranker, you are systematically passing the wrong context to your LLM.
Why it breaks
Vector similarity search is excellent at candidate retrieval. It is poor at ranking. Cosine similarity between two high-dimensional embeddings captures semantic proximity, not answer relevance for a specific query. The top result by cosine distance is not always the most useful chunk for the question at hand.
Teams that skip reranking are essentially treating their retrieval problem as solved after the first-stage ANN search. In practice, the chunk that best answers the query is often ranked 3rd or 5th by embedding similarity — close enough to retrieve, not close enough to surface first.
If your system passes the top-1 or top-2 chunks to the LLM without reranking and truncates the rest, you are systematically dropping the best answers.
Diagnosis
Run a relevance audit on your retrieval results:
python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def audit_retrieval_rank(query: str, retrieved_chunks: list[str],
ground_truth_chunk: str) -> dict:
scores = reranker.predict(
[(query, chunk) for chunk in retrieved_chunks]
)
reranked = sorted(
enumerate(retrieved_chunks),
key=lambda x: scores[x[0]],
reverse=True
)
vector_rank = retrieved_chunks.index(ground_truth_chunk) + 1
reranked_rank = next(
i + 1 for i, (orig_idx, _) in enumerate(reranked)
if retrieved_chunks[orig_idx] == ground_truth_chunk
)
return {
"query": query,
"vector_rank": vector_rank,
"reranked_rank": reranked_rank,
"improved": reranked_rank < vector_rank,
}
If reranked rank is better than vector rank on more than 30% of your test queries, you have a reranking gap that is actively hurting answer quality.
Fix
Add a cross-encoder reranker as a second-pass filter. Retrieve k=20 candidates from your vector store, rerank them, and pass the top-3 to your LLM. The cross-encoder sees the full query and each chunk together, which lets it score relevance directly rather than proximity in embedding space.
python
from sentence_transformers import CrossEncoder
from typing import List
class RerankedRetriever:
def __init__(self, vector_store, reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
self.vector_store = vector_store
self.reranker = CrossEncoder(reranker_model)
def retrieve(self, query: str, top_k: int = 3, candidate_k: int = 20) -> List[str]:
# First-stage: broad vector retrieval
candidates = self.vector_store.similarity_search(query, k=candidate_k)
# Second-stage: cross-encoder reranking
pairs = [(query, doc.page_content) for doc in candidates]
scores = self.reranker.predict(pairs)
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)
return [doc.page_content for doc, _ in ranked[:top_k]]
Cross-encoder reranking adds 50–200ms of latency for a 20-candidate set. For most production RAG workloads, that is an acceptable trade for a 15–25% improvement in answer correctness.
Failure Mode 3: Stale Index
Your embedding index is a snapshot of your documents at indexing time. When a policy is updated, a product spec is revised, or a pricing page changes, the index does not update automatically — queries continue retrieving content from weeks or months ago, with no error signal to indicate the problem.
Why it breaks
Stale index is insidious because it is invisible in standard observability. Query latency is normal. Embedding lookups return results. The system appears healthy. Users are just silently receiving outdated information.
The problem compounds with time. A document indexed 6 months ago and updated 3 times since is a liability, not an asset.
Diagnosis
Implement index freshness tracking:
python
import hashlib
from datetime import datetime
from dataclasses import dataclass
@dataclass
class IndexedDocument:
doc_id: str
content_hash: str
indexed_at: datetime
source_updated_at: datetime
def audit_index_freshness(indexed_docs: list[IndexedDocument],
max_age_days: int = 30) -> dict:
now = datetime.utcnow()
stale = []
for doc in indexed_docs:
age = (now - doc.indexed_at).days
if age > max_age_days:
stale.append({"id": doc.doc_id, "age_days": age})
if doc.source_updated_at > doc.indexed_at:
stale.append({
"id": doc.doc_id,
"reason": "source_updated_after_index",
"gap_hours": (doc.source_updated_at - doc.indexed_at).seconds // 3600,
})
return {
"total_documents": len(indexed_docs),
"stale_count": len(stale),
"stale_pct": round(len(stale) / len(indexed_docs) * 100, 1),
"stale_docs": stale[:10],
}
Fix
Implement incremental re-indexing on document change, not on a fixed schedule. Track content hashes. When a source document's hash changes, queue it for re-embedding immediately.
python
import hashlib
from datetime import datetime
class IncrementalIndexer:
def __init__(self, vector_store, embedder):
self.vector_store = vector_store
self.embedder = embedder
self.index_registry: dict[str, str] = {} # doc_id -> content_hash
def _content_hash(self, content: str) -> str:
return hashlib.sha256(content.encode()).hexdigest()
def upsert_document(self, doc_id: str, content: str, metadata: dict):
new_hash = self._content_hash(content)
if self.index_registry.get(doc_id) == new_hash:
return # Content unchanged, skip re-indexing
self.vector_store.delete(filter={"doc_id": doc_id})
chunks = self.chunk(content)
embeddings = self.embedder.embed_documents(chunks)
self.vector_store.add_embeddings(
texts=chunks,
embeddings=embeddings,
metadatas=[{**metadata, "doc_id": doc_id, "indexed_at": datetime.utcnow().isoformat()}
for _ in chunks],
)
self.index_registry[doc_id] = new_hash
Wire this to your content management system's webhook or change-data-capture stream. Every document update should trigger an upsert within minutes, not the next scheduled batch run.
Failure Mode 4: No Hybrid Retrieval (BM25 + Vector)
Pure vector search fails on exact-match queries. When a user searches for a specific error code, API endpoint, or product identifier, vector similarity often surfaces semantically related content that never contains the exact string. BM25 handles rare-term and exact-match queries precisely — hybrid retrieval combines both and consistently outperforms either approach alone.
Why it breaks
Pure vector search excels at semantic similarity. It is poor at exact matching. When a user queries for a specific product code, a person's name, an API endpoint, or an error message, vector search often surfaces semantically related but lexically different results. The chunk containing the exact string ERR_QUOTA_EXCEEDED may score lower than a chunk about "error handling" that never mentions the specific code.
BM25 (the algorithm behind classic keyword search) handles exact and rare-term matching extremely well. It rewards documents that contain the query terms, with inverse document frequency weighting meaning that rare, specific terms get boosted. What BM25 misses is paraphrase, synonym, and conceptual matching — exactly what vector search handles.
Teams that use only vector search leave a meaningful precision gap for queries with specific identifiers. Teams that use only BM25 miss semantic intent. Hybrid retrieval combines both, and on standard retrieval benchmarks it consistently outperforms either approach alone.
Diagnosis
Run a query set that mixes semantic queries ("how does the refund policy work?") and exact-match queries ("what is the timeout value for API_GATEWAY_CONNECT?"). Compare top-3 precision for vector-only versus BM25-only versus hybrid across both query types. If vector-only precision on exact-match queries is more than 15 percentage points lower than on semantic queries, you have a pure-vector blind spot.
Fix
Implement reciprocal rank fusion (RRF) to merge vector and BM25 rankings:
python
from rank_bm25 import BM25Okapi
import numpy as np
from typing import List
class HybridRetriever:
def __init__(self, vector_store, documents: List[str],
rrf_k: int = 60, alpha: float = 0.5):
self.vector_store = vector_store
self.documents = documents
self.alpha = alpha # 0 = BM25 only, 1 = vector only
self.rrf_k = rrf_k
tokenized = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
def _rrf_score(self, rank: int) -> float:
return 1.0 / (self.rrf_k + rank)
def retrieve(self, query: str, top_k: int = 5) -> List[str]:
vector_results = self.vector_store.similarity_search(query, k=top_k * 4)
tokenized_query = query.lower().split()
bm25_scores = self.bm25.get_scores(tokenized_query)
bm25_ranked = np.argsort(bm25_scores)[::-1][:top_k * 4]
rrf_scores: dict[str, float] = {}
for rank, doc in enumerate(vector_results):
doc_id = doc.metadata["id"]
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + self.alpha * self._rrf_score(rank)
for rank, idx in enumerate(bm25_ranked):
doc_id = f"doc_{idx}"
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1 - self.alpha) * self._rrf_score(rank)
sorted_ids = sorted(rrf_scores, key=rrf_scores.get, reverse=True)
return sorted_ids[:top_k]
Start with alpha=0.5 (equal weight) and tune based on your query distribution. If your users ask mostly exact-product or identifier queries, shift toward alpha=0.3 to weight BM25 more heavily.
Failure Mode 5: No Eval Loop
Without an automated eval loop, every regression in your RAG pipeline is invisible until a user complaint surfaces it. Teams without retrieval quality checks running on every deployment discover degradation 4–8× slower than teams that do — and by then, the root cause is typically tangled across multiple changes and hard to isolate.
Why it breaks
You cannot improve what you do not measure. RAG systems degrade over time as documents are updated, query patterns shift, and underlying model versions change. Without an automated eval loop tied to retrieval quality metrics, every one of these changes is invisible until a user complaint surfaces it.
The eval loop is not optional. It is the mechanism that keeps your RAG pipeline honest over its operational lifetime.
Diagnosis
Check whether your deployment pipeline currently runs any of these:
- Retrieval precision@k (what fraction of ground-truth relevant chunks appear in the top-k retrieved?)
- Answer faithfulness (does the generated answer stay within the retrieved context, or does it hallucinate beyond it?)
- Answer relevance (does the generated answer actually address the query?)
- Context recall (does the retrieved set contain all the information needed to answer correctly?)
If none of these are tracked per deployment, you are operating blind.
Fix
Build a retrieval eval suite using a golden query set and run it in CI on every deployment:
python
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class EvalCase:
query: str
expected_doc_ids: List[str]
expected_answer_contains: Optional[str] = None
def precision_at_k(retrieved_ids: List[str], relevant_ids: List[str], k: int) -> float:
top_k = retrieved_ids[:k]
hits = sum(1 for doc_id in top_k if doc_id in relevant_ids)
return hits / k
def run_retrieval_eval(retriever, eval_cases: List[EvalCase], k: int = 3) -> dict:
results = []
for case in eval_cases:
retrieved = retriever.retrieve(case.query, top_k=k)
retrieved_ids = [r["id"] for r in retrieved]
precision = precision_at_k(retrieved_ids, case.expected_doc_ids, k)
recall = sum(
1 for doc_id in case.expected_doc_ids if doc_id in retrieved_ids
) / len(case.expected_doc_ids)
results.append({
"query": case.query,
f"precision@{k}": precision,
"recall": recall,
})
avg_precision = sum(r[f"precision@{k}"] for r in results) / len(results)
avg_recall = sum(r["recall"] for r in results) / len(results)
return {
f"avg_precision@{k}": round(avg_precision, 3),
"avg_recall": round(avg_recall, 3),
"per_query": results,
}
def ci_gate(current_metrics: dict, baseline_metrics: dict,
relative_threshold: float = 0.05) -> bool:
baseline_p = baseline_metrics["avg_precision@3"]
current_p = current_metrics["avg_precision@3"]
regression = (baseline_p - current_p) / baseline_p
if regression > relative_threshold:
print(f"FAIL: precision@3 regressed {regression:.1%} (baseline={baseline_p:.3f}, current={current_p:.3f})")
return False
return True
Run this eval suite against a golden set of 50–200 query/relevant-document pairs on every deploy. Gate the deployment if precision@3 drops more than 5% relative to the last passing run.
RAG Pipeline Debugging Checklist
Run this before spending time on prompt engineering or model tuning. These five failure modes are sequential — chunking problems corrupt every downstream step, so work top to bottom. If any item below fails, fix it before moving to the next row.
| Check | Tool / Signal | Pass Condition |
|---|---|---|
| Chunk quality | Run audit_chunks()
|
truncated_sentences < 30%, avg tokens 200–600 |
| Chunk strategy | Manual inspection | Chunks are semantically coherent units |
| Reranker present | Code review | Cross-encoder reranker on first-stage candidates |
| Reranker improves rank | audit_retrieval_rank() |
Ground-truth rank improves in > 30% of queries |
| Index freshness | Hash comparison | No document indexed > 30 days without change check |
| CDC / webhook | Infrastructure review | Document updates trigger re-index within minutes |
| Hybrid retrieval | Code review | BM25 + vector fusion implemented |
| Hybrid alpha tuned | Precision comparison | Hybrid P@3 ≥ max(vector-only, BM25-only) P@3 |
| Eval suite exists | CI pipeline | Retrieval eval runs on every deployment |
| Regression gate | CI config | Deploy blocked if precision drops > 5% relative |
Top comments (0)