saurabh naik

Posted on May 18

Why production RAG fails — and the boring metrics that fix it

#ai #llm #rag #python

Most production RAG pipelines underperform for the same reason: the team treats retrieval as a solved vector-search problem, ships top-k embedding search, and then blames the generator when the answers are wrong. The "RAG is dead, long context replaces it" framing is the wrong fight. Long context doesn't fix retrieval — it hides retrieval failures behind a larger haystack while adding cost and latency.

This walkthrough is for engineers who already have a RAG prototype and want to know what to measure, what to fix, and in what order. By the end you'll have a minimal LangChain + FAISS + cross-encoder reranker pipeline and a clear separation between retrieval metrics and generation metrics.

The three components, in one paragraph

RAG is a hybrid: a non-parametric retriever (usually a dual-encoder over a chunked corpus, often paired with BM25) selects top-k passages from a document store, then a parametric LLM generates an answer conditioned on those passages. Three knobs, three failure surfaces. The original paper (Lewis et al., 2020 — arxiv.org/abs/2005.11401) introduced two variants — RAG-Sequence and RAG-Token — but in practice almost no production system jointly fine-tunes any of it. Teams freeze components and tune chunking, embeddings, and reranking.

That's the whole architecture. Everything below is about why each component fails and how to tell which one is failing.

The metrics most teams skip

If you only measure end-to-end answer quality, you cannot tell whether the retriever missed the right chunk or the generator ignored a chunk it was given. These are different bugs with different fixes. You need at least three numbers, scored on a synthetic eval set built on day one:

Retrieval recall@k — did the right chunk appear in the top-k? Computed against ground-truth passage IDs.
Faithfulness — does the generated answer actually follow from the retrieved chunks, or is it hallucinated?
Answer relevance — does the answer address the question, regardless of source?

RAGAS (Es et al., 2023 — arxiv.org/abs/2309.15217) gives you reference-free versions of the last two, validated on WikiEval at 0.95 agreement with human annotators for faithfulness (vs. 0.61 for naive GPT-3.5 prompting). The authors showed automated metrics can replace ~80% of human eval effort in iterative tuning.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Each row: a question, the retrieved chunks, the model's answer, ground truth
data = Dataset.from_dict({
    "question": questions,
    "contexts": retrieved_chunks,   # list[list[str]]
    "answer": generated_answers,
    "ground_truth": ground_truths,
})

result = evaluate(data, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)

The point isn't the score — it's that you now have three separable signals. When faithfulness is high but answer relevance is low, your retriever missed. When faithfulness is low, your generator is ignoring context. You stop guessing.

The four real failure modes

After enough postmortems they collapse to four. Each has a different fix.

1. Chunking splits the answer

The right information exists in your corpus but it's spread across the boundary between two chunks. Neither chunk alone contains the answer, so neither retrieves well. Fix: overlap your chunks (10–20% is a reasonable start), respect semantic boundaries (sections, paragraphs) before character counts, and for long technical docs consider hierarchical chunking with parent-doc retrieval.

2. Top-k drowns the generator

You retrieved the right chunk, but you also retrieved nine weakly-related ones, and the model attends to the wrong neighbor. Bigger k is not the answer. Add a reranker (next section). Precision matters more than recall once recall@20 is acceptable.

3. Stale or duplicated index

Documents drift. The same chunk appears under three different IDs because someone re-ingested without dedup. The retriever returns three near-identical neighbors and crowds out the actually relevant one. Fix: deduplicate by content hash at ingestion, version your index, and put a TTL on anything that changes.

4. Context-faithfulness gap

The right chunk is in the prompt and the model still hallucinates. This is a generator problem. Tighten the system prompt ("answer only from the provided context; say 'I don't know' if absent"), measure faithfulness explicitly, and consider a stronger or instruction-tuned model. This is the one that looks like "RAG doesn't work" and is actually "your generator doesn't follow instructions."

Lost in the Middle: why position matters

Even when the relevant chunk is retrieved, where it sits in the context window changes whether the model uses it. Liu et al., 2023 (arxiv.org/abs/2307.03172) showed retrieval-augmented QA accuracy drops from ~75% when the relevant doc is at position 1 to ~50% when it's placed in the middle of a 20-doc context window. A 25-percentage-point swing from position alone.

This is the actual argument against "just shove everything into long context." A bigger window doesn't help if the model under-attends to the middle. A reranker that promotes the best chunk to position 1 — or that lets you safely use a smaller k — is doing real work.

A minimal LangChain + FAISS + cross-encoder reranker

Hybrid retrieval (BM25 + dense) is the cheapest precision win. A cross-encoder reranker on top is the second. Here's a stripped-down pipeline that puts both in place:

from langchain_community.vectorstores import FAISS
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Chunk with overlap, respecting structure
splitter = RecursiveCharacterTextSplitter(
    chunk_size=600, chunk_overlap=80,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". "]
)
chunks = splitter.split_documents(raw_docs)

# 2. Dense index
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
dense = FAISS.from_documents(chunks, embeddings).as_retriever(search_kwargs={"k": 20})

# 3. Lexical index (catches exact terms dense misses — error codes, IDs)
bm25 = BM25Retriever.from_documents(chunks)
bm25.k = 20

# 4. Hybrid
hybrid = EnsembleRetriever(retrievers=[bm25, dense], weights=[0.4, 0.6])

# 5. Cross-encoder reranker — reorders the candidates by true query-doc relevance
ce = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
reranker = CrossEncoderReranker(model=ce, top_n=5)

retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid,
)

docs = retriever.invoke("How does the reranker change top-k recall?")

A few things to notice. The hybrid retriever pulls 20 candidates from each backend; the cross-encoder scores them properly (a true bi-input model, not a similarity proxy) and returns the top 5. Final k to the generator is small — which both fixes the Lost-in-the-Middle problem and cuts your token cost. Switching from a bi-encoder-only setup to this on a real corpus usually moves retrieval recall@5 by double-digit points without touching the generator.

Note: Cross-encoders are slow per pair — they recompute attention over the concatenated query+doc. That's why you only run them on the top-20 from the cheap retriever, not the whole corpus.

The fix order that has actually worked

If you can do one thing this week, in this order:

Build a 50–100 question synthetic eval set with ground-truth chunk IDs. Without it you're flying blind.
Add BM25 alongside your dense retriever and ensemble them. Cheapest precision gain.
Add a cross-encoder reranker. Measure recall@5 before and after.
Wire up RAGAS faithfulness + answer-relevance so you can separate retriever bugs from generator bugs.
Only then think about query rewriting, HyDE, fine-tuning embeddings, or a bigger generator.

The interesting work is at the top of that list, not the bottom. Most teams reverse it.

Wrapping up

Retrieval recall is its own metric. Measure it separately from answer quality, or you'll keep blaming the generator for the retriever's miss. Long context doesn't replace retrieval — it just hides which one of your four failure modes is the one biting you.

Two follow-ups worth a read if you want to go deeper:

Jason Liu's "systematically improving RAG" writing (jxnl.co/writing/) — the most practical eval-driven approach I've seen.
GraphRAG for cases where your corpus has real entity-relationship structure and dense retrieval keeps missing the connection.

What's the one retrieval failure that took you longest to diagnose? I'm curious how often it turned out to be chunking vs. reranking vs. the generator just ignoring the context.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.