RAG Doesn't Hallucinate — Your Retrieval Does: Four Production Autopsies

#rag #ai #python #machinelearning

A confident wrong answer is almost never a model failure. It is a retrieval, grounding, or measurement failure wearing the model's voice.

The demo was flawless.

A small group is gathered around a laptop watching a retrieval-augmented assistant answer questions about the company's own policies. Someone from the business asks the hard one — the edge case that usually trips people up — and the assistant answers it perfectly, citing the right document. There is a pause, and then the sentence that launches a thousand doomed projects: "This is incredible. Can we have it live by end of quarter?"

Three weeks into production, the same assistant tells a customer-facing rep that a claim is covered when it is not. It is articulate. It is confident. It is wrong. And the post-incident question lands on the engineering team like a verdict: "Why is the AI hallucinating?"

Here is the reframing that has saved me more grief than any model upgrade: in a well-built system, hallucination is rarely the model making things up out of thin air. It is the model faithfully summarizing the wrong context — or no context — because something upstream of the language model failed quietly. The model is the last link in a chain, and it gets blamed for the failures of every link before it.

So let's do what the incident review should have done: stop staring at the model and autopsy the pipeline. I'll use a small, fully-tested reference implementation (pure-Python retrieval core, link at the end) to make each failure concrete. Four autopsies. Each one a symptom you'll recognize, the real cause underneath it, and the fix.

Autopsy #1 — "It couldn't find the thing that was right there"

Symptom: a user searches for a specific claim code, SKU, or account number, and gets back documents that are about the right topic but don't contain the exact record. The information exists in the corpus. The system just couldn't surface it.

Cause of death: pure vector search. Embeddings are extraordinary at aboutness — "coverage for an inpatient procedure" finds "hospital admission benefits" even with no shared words. But that same blurring is fatal for exact tokens. To an embedding model, CLM-4417-B and CLM-4471-B live in nearly the same place in vector space, because semantically they are "a claim code." Dense retrieval is built to ignore the surface form — and the surface form is the entire point when someone is looking up an identifier.

The fix: hybrid retrieval. Run dense (vector) and sparse (BM25 lexical) retrieval side by side and fuse the results. BM25 nails the exact strings — codes, SKUs, account numbers — that embeddings smear; vectors catch the paraphrases that keyword search misses. The detail that makes this robust in practice is how you fuse: Reciprocal Rank Fusion combines the two lists by rank position, not by raw score. That matters more than it sounds, because a cosine similarity of 0.82 and a BM25 score of 14.3 are not on the same scale and never will be. Fusing on rank sidesteps the entire futile exercise of calibrating two incompatible score distributions against each other. You get the strengths of both retrievers without pretending their numbers mean the same thing.

Autopsy #2 — "The answer was in the corpus, ranked eighth"

Symptom: the correct passage was retrieved — it just wasn't retrieved high enough. It sat at position eight while the model only ever saw the top four. From the model's perspective the answer simply did not exist, so it improvised from the four mediocre passages it was handed.

Cause of death: treating first-stage retrieval as final. Fast retrieval — whether ANN over vectors or BM25 — is tuned for recall across a huge corpus: cast a wide net, be cheap, be approximate. It is explicitly not tuned to know which of its top fifty hits is the one. And your context window is a guillotine: top-k is a hard cut, and anything below the line is invisible to the model no matter how relevant.

The fix: a reranking stage. Retrieve a deliberately wide candidate set, then run a precise, more expensive reranker over just those candidates to reorder them, so the genuinely best passage is pulled up into the narrow window the model actually reads. The mental model is two-phase: a cheap, high-recall first stage to narrow millions to dozens, then a precise, high-cost second stage to order those dozens correctly. This is also where you spend your latency budget deliberately — the reranker is the most expensive step in the pipeline, so you cap the candidate count and batch the work, rather than reranking everything and blowing your response time.

Autopsy #3 — "It answered when it should have shut up"

This is the one everyone calls hallucination, and it is the most preventable of all.

Symptom: the retrieval was genuinely poor — nothing relevant came back — and the model answered anyway, fluently and falsely.

Cause of death: a generator with no obligation to ground. Think about what "answer the user's question" instructs a language model to do when the context is empty: produce a plausible answer. You have literally asked it to fill the gap, and filling gaps with plausible text is precisely what it is best at. The model isn't malfunctioning; it's obeying. The failure is that nobody made grounding a requirement and nobody made silence an acceptable output.

The fix is two halves, and you need both. The first is instruction and constraint: tell the generator to answer using only the retrieved context, to cite its sources by id, and to say so explicitly when the context is insufficient — at temperature zero, so it isn't improvising stylistic flourishes either. The second, and the non-negotiable one, is a guardrail that sits after generation and enforces the rule the prompt merely requests: if the answer carries no citation to retrieved context, it does not go to the user. It is replaced with an honest "I don't have enough information to answer that."

The cultural shift hiding inside that mechanism is the real lesson: you have to make "I don't know" a first-class, successful outcome. In a system that touches claims, payments, or patient data, a refusal is not a failure of the product — it is the product working correctly. The most dangerous answer is not the one that says "I'm not sure." It is the confident, well-cited-looking one that is wrong, because that is the one a human will act on.

Autopsy #4 — "It was right at launch and wrong by spring"

Symptom: nothing broke, exactly. Quality just... eroded. The corpus grew, someone tweaked the prompt, the index was rebuilt with a different chunking strategy, and one quiet Tuesday the answers were measurably worse — but nobody noticed until complaints accumulated.

Cause of death: treating search quality as a vibe instead of a number. Almost every RAG system I've reviewed has zero automated measurement of retrieval quality. Teams test that the service returns 200 OK; they do not test that it returns the right documents. So regressions are invisible by construction. You cannot defend a quality you never measured.

The fix: an evaluation harness wired into CI. Build a labeled set — queries paired with their known-relevant document ids — and compute the boring, decades-old information-retrieval metrics on every change: precision@k (of the top k results, how many were relevant) and mean reciprocal rank (how high up the first correct answer landed). Then put that harness in the build and gate merges on it: if a change drops MRR below threshold, the build fails, the same as a broken unit test. This is the move that turns RAG from a demo into an engineered system — search quality becomes a tested contract, not a hope. Every other autopsy in this article is something the eval harness would have caught before a customer did.

The autopsy nobody orders until it's too late: latency and graceful failure

Two more deaths worth pre-empting, because they don't show up in a demo and always show up in production.

The first is cost and latency at real scale. Ten million chunks at 768-dimension float32 embeddings is roughly thirty gigabytes of vectors — fine in memory on a big node, but the moment you need high availability and growth you want an HNSW index in a real vector store, where queries stay roughly logarithmic and you can hold a sub-twenty-millisecond budget for the search itself, leaving room for the reranker. None of this is exotic, but you have to do the arithmetic before you choose the architecture, not after the p99 alarms fire.

The second is what happens when a dependency dies. A serious pipeline degrades instead of collapsing: if the vector store is down, fall back to BM25-only and still return useful results; if the LLM is unavailable or you've blown the token budget, fall back to a deterministic extractive answer that stitches together the most relevant retrieved sentences with citations. The contract the user sees — a grounded answer with sources, or an honest refusal — holds even as components fail behind it. Resilience here is not redundancy; it is having a worse but still honest answer ready.

What the four autopsies have in common

Read the causes of death back to back and a single pattern emerges. Not one of them is "the language model is bad." Every single failure lived upstream of generation — in how documents were found, how they were ranked, whether the answer was allowed to be ungrounded, and whether anyone was measuring. The model was the last hand to touch the work, so it took the blame for the whole assembly line.

That is the mindset shift worth keeping: a production RAG system is a retrieval and evaluation system that happens to end in a language model, not a language model with some documents bolted on. Get retrieval honest, make grounding mandatory, let the system say "I don't know," and measure quality like you mean it — and the "hallucination problem" quietly stops being one.

I built a complete, runnable reference implementation of everything above — hybrid dense+BM25 retrieval, Reciprocal Rank Fusion, reranking, grounded generation with citations, the groundedness guardrail, a LangGraph agent, and the precision@k / MRR evaluation harness — with a pure-Python core you can run and test with no ML infrastructure at all.

Clone it and run docker compose up: https://github.com/mizbamd/agentic-rag-engine

It's one of five reference implementations in an open Enterprise Platform Reference Architecture covering legacy modernization, production RAG, governed AI agents, MACH pricing, and a streaming lakehouse. I write about building platforms that are not allowed to fail — follow along.

Originally published on Medium.