Davide Mibelli

Posted on May 14 • Originally published at Medium

RAG in Production: What the Tutorials Don't Tell You

#machinelearning #python #ai #tutorial

I built a RAG system that scored 91% on our internal eval suite. It retrieved the right chunks four out of five times in every benchmark we ran. We shipped it. Users thought it was broken.

The gap between "works in evaluation" and "works in production" is the thing every RAG tutorial skips. This article is what I learned closing that gap across three different production deployments — a customer support bot, an internal knowledge base, and a document Q&A tool for a legal team.

Why your evals lie to you

The typical RAG eval flow: take 50 question-answer pairs, run retrieval, score chunk relevance, measure answer quality. The benchmark looks good. Production does not.

The problem is evaluation datasets are clean. Real user questions are not. Users ask ambiguous things, reference context from earlier in the conversation, use company-specific jargon that is not in your embeddings vocabulary, and ask questions that span multiple documents. Your 50-pair eval dataset does not cover any of this.

The more subtle problem: retrieval correctness is not the same as answer usefulness. A chunk can be semantically relevant to the query but contain outdated information, contradict another retrieved chunk, or be missing the specific number the user actually needs. Cosine similarity does not catch any of this.

Before you optimize retrieval metrics, instrument what users actually do. In my customer support deployment, the clearest signal was not retrieval recall — it was how often users rephrased their question immediately after getting an answer. That rephrasing rate was the real quality metric.

Chunking is where it actually breaks

The default chunking strategy in most tutorials: split every 512 tokens with 50-token overlap. This is almost always wrong.

The problem is that 512 tokens is an arbitrary number based on older embedding model limits, not on the structure of your documents. A 512-token chunk cut out of the middle of a legal clause or a technical procedure is often meaningless without the surrounding context.

What actually works depends on your document type:

For structured documents (FAQs, product docs, knowledge base articles): chunk by logical unit — one question-answer pair, one procedure step, one concept section. Use your document's own structure as the chunking boundary. If your docs use consistent heading patterns, split on those.

For long-form prose (contracts, reports, research papers): hierarchical chunking. Keep a parent chunk of 1000–2000 tokens for context retrieval, and child chunks of 150–300 tokens for precise matching. At retrieval time, return the child chunk for relevance scoring but pass the parent chunk to the LLM as context.

For code documentation or READMEs: file-level or function-level chunks, never mid-function splits.

The overlap parameter also matters more than people expect. Overlap exists to avoid losing information at chunk boundaries, but it inflates your vector store size and retrieves duplicate context. I found that semantic chunking — splitting at sentence boundaries that mark topic shifts — eliminated the need for overlap almost entirely.

Retrieval is rarely the problem you think it is

When RAG produces wrong answers, the instinct is to improve retrieval. Better embeddings, more chunks, higher top-k. This is usually the wrong lever.

In my experience, retrieval is failing only about 30% of the time when users report bad answers. The other 70% is one of:

The right chunk was retrieved but the LLM ignored it — this is a prompt engineering problem, not a retrieval problem
The answer requires synthesizing across multiple chunks — retrieval returned individually correct chunks but the LLM could not connect them
The question is genuinely unanswerable from the knowledge base — the document does not exist or is outdated

To separate these, add logging at both retrieval and generation time. Log the top-k chunks and the final answer separately. A human spot-check of 20 failure cases per week will show you very quickly which category you are actually in.

When retrieval genuinely is the problem, the highest-leverage fix is hybrid search: dense retrieval (embeddings + cosine similarity) combined with sparse retrieval (BM25 keyword matching). Dense retrieval handles semantic similarity. BM25 handles exact matches — product names, error codes, version numbers, any term where "sounds like" is the wrong answer.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_chroma import Chroma

dense_retriever = Chroma(...).as_retriever(search_kwargs={"k": 10})
sparse_retriever = BM25Retriever.from_documents(docs, k=10)

ensemble = EnsembleRetriever(
    retrievers=[dense_retriever, sparse_retriever],
    weights=[0.6, 0.4]  # tune based on your query distribution
)

The weight split between dense and sparse depends on your documents. Technical documentation with lots of product names and version numbers benefits from higher BM25 weight. Conversational knowledge bases lean toward dense.

Reranking changes the answer quality more than anything else

If there is one thing to add to a RAG pipeline that makes the biggest difference in production answer quality, it is a cross-encoder reranker between retrieval and generation.

The retrieval step uses bi-encoder embeddings — query and document are embedded independently, similarity is a dot product. This is fast but imprecise. A cross-encoder takes the query and a candidate chunk together as a single input and scores their relevance jointly. Much more accurate, but too slow to run over your entire corpus.

The standard pattern: retrieve top-20 with the fast bi-encoder, rerank with the cross-encoder, pass top-3 or top-5 to the LLM.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_n: int = 5) -> list[str]:
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for _, chunk in ranked[:top_n]]

In the legal document system, adding reranking reduced hallucinations in answers by roughly 40% — not because retrieval improved, but because the LLM stopped having to sort through marginally relevant context and could focus on the actually relevant chunks.

The infrastructure issues no tutorial shows

Context window budget. Top-5 chunks at 500 tokens each is 2500 tokens before you add the system prompt and the user message. With GPT-4 this is fine. With smaller models or high-volume APIs where you want to minimize token cost, you need explicit context budgeting. Know your LLM's context window, subtract your fixed prompt overhead, and set your chunk count and size to fit.

Stale chunks. Documents in production change. Your chunk embeddings do not update automatically. You need a pipeline that detects document changes (by checksum or last-modified timestamp) and re-embeds only the changed documents. I've seen teams manually re-index their entire corpus monthly as a workaround. That is not a solution for any corpus over a few thousand documents.

Conflicting information. If the same concept is documented in multiple places with different details, retrieval will return both. The LLM will either pick one arbitrarily or produce a contradictory answer. The fix is upstream — deduplicate your knowledge base and establish a single source of truth. Retrieval cannot save you from bad source data.

Context window management

One problem that grows slowly and then all at once: as your knowledge base expands and you increase top-k to improve recall, your context window fills up. You pass 10 chunks at 500 tokens each, plus system prompt, plus conversation history, and suddenly you are at 7000 tokens per request on a model with an 8000-token limit.

The naive fix is to increase top-k and hope the model attends to the right parts. The correct fix is explicit context budgeting:

MAX_CONTEXT_TOKENS = 4000  # reserve headroom for prompt + answer
TOKENS_PER_CHUNK = 400     # approximate after chunking

max_chunks = MAX_CONTEXT_TOKENS // TOKENS_PER_CHUNK  # = 10

chunks = rerank(query, retrieve(query, k=20), top_n=max_chunks)

Calculate the budget before retrieval, not after. If you are hitting the limit regularly, reduce chunk size rather than reducing top-k — smaller chunks at the same top-k gives the model more coverage with the same token budget.

What to actually monitor

Stop measuring retrieval precision in isolation. In production, instrument:

Rephrasing rate: how often a user asks a follow-up that is essentially the same question reworded. High rate means the answer was not useful, regardless of retrieval metrics.
Answer rejection rate: if your UI has thumbs-down feedback, track it. Correlate failures with retrieved chunk IDs to identify which documents produce bad answers consistently.
Latency by pipeline stage: retrieval, reranking, and generation each have different latency profiles and different optimization paths. Aggregate P95 latency tells you nothing useful about where to look.
Unanswerable rate: how often the LLM says "I don't have information on this." Below 5% and your system is probably hallucinating answers it should refuse. Above 30% and your knowledge base has coverage gaps.
Chunk age: track when each chunk was last re-indexed. Any chunk older than your document update frequency is potentially stale. This one takes ten minutes to add to your ingestion pipeline and saves hours of debugging mysterious wrong answers three months from now.

The RAG tutorial teaches you to build a pipeline. Production teaches you to instrument one. The teams I've seen succeed were instrumenting before they had users, not after.

What is the failure mode you hit first in your own RAG deployment?

Originally published on Medium.

DEV Community