Bikash Das

Posted on Feb 25 • Originally published at dasbikash.substack.com

RAG in production is nothing like the tutorials

#ai #architecture #rag #python

Every RAG tutorial follows the same script. Take some documents, split them into chunks, generate embeddings, store them in a vector database, and retrieve the top-K results when a user asks a question. Congratulations — you have a demo.

Then you deploy it to real users, and everything falls apart.

I have been building and iterating on a production RAG pipeline for months. Nearly every assumption from those tutorials turned out to be wrong at scale. This article is about what actually works — and what the tutorials leave out.

A typical tutorial RAG pipeline

The standard architecture looks something like this: a user asks a question, you embed that question, run a similarity search against your vector database, grab the top 5 results, and pass them to an LLM as context. Simple, elegant, and dangerously incomplete.

Here is what this approach gets wrong:

Not every question needs retrieval. When a user says "thank you" or "what is machine learning?", searching your knowledge base is wasted compute and often produces confusing results. Your pipeline needs to decide whether retrieval is even necessary before it goes looking.

Vector similarity alone is not enough. Semantic search is powerful for conceptual queries ("how does your return policy work?"), but it falls flat for exact matches. When someone asks for "invoice INV-2024-0847" or a specific product code, you need keyword matching. Vector search will give you semantically similar but factually wrong results.

One-shot retrieval breaks on complex questions. "Compare the pricing of Plan A and Plan B, and explain which one is better for a team of 10" requires information from multiple places. A single retrieval pass will grab chunks about one plan or the other, but rarely both.

Users do not know their retrieval confidence. The LLM will confidently present whatever context it receives, even if the retrieved chunks are barely relevant. Without a confidence signal, users cannot distinguish between grounded answers and educated guesses.

What a production pipeline actually looks like

After months of iteration, here is the seven-step architecture that survives contact with real users. It is not elegant in the tutorial sense — it is effective in the production sense.

Step 1: Decide before you search

Every query should first pass through a single LLM call that makes multiple decisions simultaneously: Does this query need retrieval at all? What type of query is it? Should you use hybrid search? Does it need decomposition into sub-queries? Should you rerank the results? And if decomposition is needed, what are the sub-queries?

The naive approach is running separate classifiers for each decision. That costs you multiple sequential LLM calls and adds hundreds of milliseconds of latency. A single well-prompted call can make all six decisions in one pass. The latency savings compound fast.

The query analyzer should also classify the search weight — how much to favor vector search versus keyword search. A conceptual query like "how does billing work?" gets high vector weight (alpha 0.7). A product code lookup gets high keyword weight (alpha 0.3). Legal citation searches go even lower (alpha 0.1). This should not be a static setting — it needs to adapt per query based on the detected query type.

Step 2: Hybrid search — two engines, one result

Run vector search and BM25 keyword search in parallel. Both return candidates, and you combine them using a weighted blend.

The combination is straightforward: normalize both result sets to a 0-1 scale, then blend using the alpha your query analyzer selected. A query about "how refunds work" might use 70% vector similarity and 30% keyword matching. A query for "clause 14.2(b)" flips that ratio.

The key design principle: graceful degradation. If vector search fails, fall back to BM25 only. If BM25 fails, vector only. The system should never return nothing because one component had a bad day.

Also oversample — fetch 5x the target number of results from each engine before fusion. This ensures enough candidates survive the blending process to fill the final result set with genuinely relevant chunks.

In practice, I run Pinecone for vector search and an in-memory BM25 index over MongoDB. But the pattern works with any combination — Weaviate, Qdrant, Elasticsearch, whatever your stack uses.

Step 3: Reranking — the quality filter

After hybrid search produces candidates, run them through a cross-attention reranker (Cohere, Jina, or similar). This is fundamentally different from the initial retrieval — it reads each candidate alongside the original query and scores true relevance, not just surface similarity.

Rerank the top 20 candidates. With Cohere, the cost is roughly $0.001 per query for 20 documents — negligible compared to the quality improvement. The reranker catches cases where a chunk is semantically similar but factually irrelevant, something pure vector search misses constantly.

Step 4: Document processing strategies — not all documents are equal

Here is something tutorials never mention: how you process documents before they enter the pipeline matters as much as how you search them.

There are three strategies worth implementing:

Semantic chunking is the baseline. Standard chunking with embeddings. Fast, cheap, good for straightforward documents like FAQs and simple articles.

Contextual enrichment adds an LLM step before embedding. Prepend a generated context paragraph to each chunk — explaining where this chunk fits in the overall document. This dramatically improves retrieval for chunks that are meaningless in isolation. Think of a table row that says "Q3: $2.4M" — without context, the embedding captures almost nothing. With a prepended paragraph explaining "This is the revenue table from the 2024 annual report, Q3 column," the embedding suddenly captures the right meaning.

Hierarchical chunking uses multi-level chunking — document, section, paragraph. The parent-child relationships enable automatic sibling merging during retrieval. If two chunks from the same section both score highly, merge them with their parent context. This recovers information that strict chunk boundaries would otherwise split.

The practical implication: a simple FAQ works fine with semantic chunking. A 200-page operations manual benefits from contextual enrichment — otherwise "see section 4.2 above" chunks are useless without context. Match the processing strategy to the document complexity.

Step 5: Query decomposition for complex questions

When the query analyzer detects a multi-part question, split it into sub-queries. "Compare Plan A and Plan B for a team of 10" becomes three searches: pricing for Plan A, pricing for Plan B, and team size limits.

Sub-queries should execute in parallel — embeddings generated simultaneously, searches run simultaneously. Merge results with deduplication, ensuring the same chunk does not appear twice from different sub-queries.

Cap the number of sub-queries (3-6 depending on your tier or plan). Unbounded decomposition on adversarial inputs can get expensive fast.

Step 6: Two-tier caching — because RAG is expensive

Every query should check two caches before doing any real work.

Exact-match cache: MD5 hash of the query text. Same question, same answer, zero cost. Set TTL to about one hour — long enough to catch repeated questions, short enough to reflect knowledge base updates.

Semantic cache: Compare the query embedding against cached embeddings using cosine similarity. If a cached query is 92% or more similar, return the cached result. This catches rephrased versions of the same question ("how do I get a refund?" and "what is your refund process?") without re-running the entire pipeline.

In my pipeline, these two caches combined eliminate about 30-40% of query executions. That is a direct cost savings — fewer LLM calls, fewer vector searches, faster response times for common questions.

Step 7: Confidence scoring — honesty as a feature

Every response should include a retrieval confidence rating: HIGH, MEDIUM, or LOW. Calculate this from the actual search scores, not from the LLM's self-assessment (which is notoriously unreliable).

HIGH means you found highly relevant chunks with strong similarity scores — the answer is well-grounded. MEDIUM means somewhat relevant content but the match is not definitive. LOW means barely anything relevant was found — the LLM is largely working from general knowledge.

Surface this signal directly to users. When confidence is LOW, the system should say so. The alternative — presenting uncertain answers with false confidence — destroys trust permanently. An honest "I'm not sure about this" preserves it.

What I would tell you if you are building this

Do not start with the perfect pipeline. Start with basic vector search, then add hybrid search when you see keyword queries failing. Add reranking when you notice relevant chunks ranking below irrelevant ones. Add contextual enrichment when documents with internal references produce bad results.

Every layer solves a real problem you will encounter. But you will not encounter all of them on day one, and building them all upfront means building things you do not understand yet.

The unsexy truth about production RAG is that it is not one breakthrough — it is twenty small improvements that compound. Each one adds maybe 5-10% to retrieval quality. Stack enough of them and the system goes from "sometimes useful" to "actually reliable."

That reliability is what turns a demo into a product people depend on.

I built a production RAG pipeline with all seven of these layers for Cuneiform Chat, an AI agent platform.
If you are building something similar and want to compare notes, I would love to hear from you — drop a comment below.

What surprised you most about deploying RAG in production? Did you hit the same walls, or different ones?

DEV Community