What Actually Breaks When You Put RAG in Production

#python #ai #machinelearning #rag

Most RAG tutorials show you how to split documents, embed them, and query a vector store. That part works in a weekend. The part that takes weeks is everything that breaks once real users hit it.

I've built RAG systems for code review automation, research synthesis, and data extraction pipelines. Here's what I wish someone had told me before the first deployment.

1. Chunking Strategy Is Your Biggest Lever

The default "split at 500 tokens with 50 token overlap" works for blog posts. It falls apart on structured data.

For code: chunk by function/class boundaries, not token count. A function split across two chunks loses its meaning in both. AST-aware chunking is worth the complexity.

For legal/financial docs: chunk by section headers. A clause that spans two chunks will be retrieved partially, and partial legal text is worse than no text.

For conversations/logs: chunk by turn or time window. Overlapping chunks cause duplicate retrieval that confuses the synthesis step.

The pattern: match your chunking to the natural structure of your domain. Generic splitters are fine for prototypes, not production.

2. Retrieval Quality Degrades Silently

The scariest failure mode in RAG is when retrieval returns plausible but wrong chunks. The LLM synthesizes a confident answer from bad context, and nobody notices until a user catches it.

Three things that help:

Hybrid retrieval (keyword + semantic). Pure vector search misses exact matches. Pure keyword misses semantic similarity. Use both, re-rank the merged results. BM25 + cosine similarity with reciprocal rank fusion is a solid baseline.
Retrieval evaluation as a CI step. Build a set of 50-100 question/expected-chunk pairs. Run them on every index rebuild. If recall drops below your threshold, block the deploy. This catches embedding model changes, chunking regressions, and data quality issues.
Source attribution in every response. Not as a feature, as a debugging tool. When a user reports a bad answer, you need to trace it back to the specific chunks that were retrieved. Without attribution, debugging is guesswork.

3. Embedding Models Drift Between Versions

When your embedding provider ships a new model version, your existing vectors become incompatible. I learned this the hard way: updated the model, didn't re-embed, and retrieval quality cratered because the new embeddings lived in a different vector space than the old ones.

The fix: version your embedding model alongside your index. When the model changes, rebuild the entire index. Yes, it's expensive. Yes, it's necessary. Budget for it.

If you're running multiple embedding models (e.g., one for code, one for natural language), each gets its own index and its own versioning. Mixing embedding spaces in a single index produces garbage retrieval.

4. Context Window Management Is an Engineering Problem

"Just stuff everything into the context" stops working when you have 20 retrieved chunks and a 128K context window. You hit three issues:

Cost: 128K input tokens per query adds up. At scale, this is your biggest line item.

Latency: Time-to-first-token scales with input size. Users notice.

Quality: Models perform worse with more context, not better. There's a well-documented "lost in the middle" effect where information in the center of long contexts gets ignored.

What works: retrieve more, then aggressively re-rank and trim. Retrieve 20 chunks, re-rank to the top 5, include only those in context. The re-ranking step (cross-encoder or LLM-based) is cheaper than stuffing everything into the generation call.

5. The Ingestion Pipeline Is Half Your System

In tutorials, data lives in a folder of PDFs. In production, data arrives continuously from APIs, uploads, webhooks, and scheduled pulls. Your ingestion pipeline needs:

Deduplication: the same document uploaded twice shouldn't create duplicate chunks.
Incremental updates: changing one document shouldn't require re-embedding your entire corpus.
Format handling: PDFs, HTML, Markdown, DOCX, and whatever your users throw at you. Each has its own extraction quirks (PDF table extraction alone is a multi-week project).
Monitoring: how many documents are in the index, when was the last successful ingest, what's the error rate. If ingestion silently fails, your RAG system serves stale data and nobody knows.

6. Evaluation Is Not Optional

"It seems to work" is not a production bar. You need:

Retrieval metrics: precision and recall on your test set.
Generation metrics: factual accuracy against known-good answers. LLM-as-judge works here if you build a good rubric.
End-to-end latency: p50, p95, p99. Users won't wait 30 seconds for an answer.
Cost per query: know your unit economics before you scale.

Run these on every deploy. Dashboard them. Alert on regressions. Treat your RAG system like any other production service.

The Bottom Line

RAG in production is 20% retrieval science and 80% engineering discipline. The hard parts aren't the algorithms; they're the pipelines, the monitoring, the evaluation, and the debugging. The teams that ship reliable RAG systems treat it as a production service, not a notebook experiment.

I build production AI systems, RAG pipelines, and full-stack applications. If you're working on something similar and want to compare notes, find me at astraedus.dev.

If you're building AI agents for production, check out my book Production AI Agents on Amazon Kindle. It covers architecture patterns, tool design, multi-agent coordination, and deployment strategies.