Anil Prasad

Posted on Jun 12 • Originally published at open.substack.com

How I took a production RAG pipeline from 61% to 97% accuracy (6 stages, full code)

#ai #machinelearning #python #anilprasad

Six months in production on a healthcare RAG system. Four rewrites. Here is the exact pipeline, every stage, and the code. The reference implementation is open source and linked at the bottom.

TL;DR
A weekend tutorial got our retrieval system to 61% accuracy. Six months of production work got it to 97%, under 2 seconds at P99, at $0.08 per query. The gains came from six stages added in order of return, not from a better model. Here is each one with code you can drop into your own pipeline.
If you only have five minutes, here is the whole thing:

Query rewriting turns vague questions into searchable ones. +11 points. Almost free.
Hybrid retrieval runs dense + BM25 and fuses them. +9 points.
Cross-encoder reranking rescores the top candidates properly. +8 points.
Context compression strips irrelevant sentences before generation. +5 points.
Citation guard blocks any claim that is not grounded in a source.
Answer validation routes multi-hop questions to a human instead of guessing.

First, measure where you actually fail
Before writing any code, we instrumented the pipeline and traced every wrong answer to its cause. The result changed our entire roadmap.

64% of failures were retrieval. 23% were chunking. Only 13% were the generator hallucinating from good context. We had spent two months tuning prompts, which was 13% of the problem. Lesson one: measure before you optimize, because your intuition about where RAG breaks is almost always wrong.

Stage 1: query rewriting

The user's raw message is rarely a good search query. What did it say about the dosage? has no good match in any index, because the meaning is in the previous turns. A small 8B model rewrites it into a standalone query first.

REWRITE_SYSTEM = """You rewrite a user's latest message into a single,
standalone search query. Resolve all pronouns and references using the
conversation. Keep it specific. Output only the rewritten query."""

def rewrite_query(history: list[dict], latest: str, llm) -> str:
convo = "\n".join(f"{m['role']}: {m['content']}" for m in history[-4:])
prompt = f"{convo}\nuser: {latest}\n\nStandalone search query:"
out = llm.complete(
system=REWRITE_SYSTEM, prompt=prompt,
model="small-8b", max_tokens=64, temperature=0.0,
).strip()
return out or latest

Cost: about $0.0001 per query. Gain: +11 points, from 61% to 72%. This is the highest return change in the entire pipeline and the one most people skip.

Stage 2: hybrid retrieval

Embedding similarity is great at meaning and weak at exact terms. Two passages can be close in vector space and mean opposite things. Keyword search has the opposite failure mode. So run both and fuse with reciprocal rank fusion, which needs no weight tuning.

from rank_bm25 import BM25Okapi

def hybrid_search(query, dense_index, bm25: BM25Okapi, corpus, k=20):
dense_hits = dense_index.search(query, k=k) # [(doc_id, score)]
bm25_scores = bm25.get_scores(query.split())
bm25_hits = sorted(enumerate(bm25_scores),
key=lambda x: x[1], reverse=True)[:k]

fused, C = {}, 60
for rank, (doc_id, _) in enumerate(dense_hits):
    fused[doc_id] = fused.get(doc_id, 0) + 1 / (C + rank)
for rank, (doc_id, _) in enumerate(bm25_hits):
    fused[doc_id] = fused.get(doc_id, 0) + 1 / (C + rank)

ranked = sorted(fused.items(), key=lambda x: x[1], reverse=True)
return [corpus[doc_id] for doc_id, _ in ranked[:k]]

Gain: +9 points, from 72% to 81%. Dense and sparse retrieval are not competitors. Use both.

Stage 3: cross-encoder reranking

Stages 1 and 2 are fast because they score the query and each document independently. A cross-encoder reads them together, which is slower and much more accurate. So you run it only on the top candidates the cheap stages already found.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, candidates, top_n=5):
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [c for c, _ in ranked[:top_n]]

Gain: +8 points, from 81% to 89%. The classic retrieve-then-rerank pattern, and it earns its cost because you only rerank a handful of candidates.

Stage 4: context compression

A retrieved passage can be the right document and still carry sentences that have nothing to do with the question. Each irrelevant sentence is a chance for the model to anchor on the wrong thing. So score sentences against the query and drop the ones that do not earn their place.

def compress_context(query, passages, relevance_model, threshold=0.5):
kept = []
for p in passages:
sentences = split_sentences(p.text)
scored = relevance_model.score(query, sentences) # 0..1 per sentence
relevant = [s for s, sc in zip(sentences, scored) if sc >= threshold]
if relevant:
kept.append(p.with_text(" ".join(relevant)))
return kept

Gain: +5 points, from 89% to 94%. Bonus: it cuts your generation token bill, because you stop paying to send the model context it should ignore.

The pipeline so far

Stages 1 through 4 took us from 61% to 94%. The last two stages do not chase points. They make the system honest, which in a regulated domain matters more.

Stage 5: citation guard

Before an answer ships, every claim in it has to trace back to a retrieved source. If a sentence has no supporting passage, it does not go out.

def citation_guard(answer_claims, sources, entailment_model, min_support=0.7):
for claim in answer_claims:
support = max(entailment_model.entails(s.text, claim) for s in sources)
if support < min_support:
return False, claim # ungrounded claim, block it
return True, None

Stage 6: answer validation

Some questions need three or more documents synthesized together. That is where RAG quietly fails by writing a fluent, wrong answer. Detect those and route them to a human.

def validate_answer(query, answer, sources, confidence):
if confidence < 0.6:
return route_to_human(query, reason="low confidence")
if requires_multi_hop(query) and len(sources) < 2:
return route_to_human(query, reason="insufficient evidence")
return answer

Together stages 5 and 6 took the production number from 94% to 97%. The real output is not the three points. It is the 3% the system now refuses to answer automatically. Serving an uncertain answer is not honesty. It is a liability.

And the climb, stage by stage:

How to adopt this
You do not need a six-month rebuild. Add stages in order of return and measure after each one, so you know which change earned which points.

Query rewriting first. A day of work, nearly free to run.
Hybrid retrieval next, because most teams run embeddings only.
Reranking third.

Compression fourth.

Build the guards last, once accuracy is where you want it.

Run it yourself

The full reference implementation is open source, including every stage above, the benchmark harness that produced these numbers, and a 250-case adversarial test suite that caught the failures we did not anticipate. Clone it and run it today.

github.com/anilatambharii

I write up the production AI work in more depth, with the narrative and the failures, on my newsletter first. If the deep version is useful to you, that is where it lives: anilsprasad.substack.com

If you are running RAG in production, I would like to know one thing in the comments: what does your error breakdown look like? Retrieval, chunking, or generation? I read all of them.

HumanWritten #ExpertiseFromField

DEV Community

How I took a production RAG pipeline from 61% to 97% accuracy (6 stages, full code)

HumanWritten #ExpertiseFromField

Top comments (0)