Aashir As

Posted on Apr 20

7 Production RAG Mistakes I Made (And How to Fix Them)

#mcp #ai #llm #rag

I've shipped RAG systems for healthcare records, financial document search, and real estate databases. The early ones had problems that cost us weeks of debugging in production. Here are the seven mistakes I made most often and what we do now instead.

Want the full production architecture before reading this? Start here: How to Build a Production RAG System. This post is about the failure modes.

Mistake 1: Fixed-Size Chunking

I split every document into 512-token chunks with 50-token overlap. It felt clean and predictable. It broke almost immediately on real documents.

A medical discharge summary doesn't divide neatly at 512 tokens. A section heading on token 511 would be split from its content on token 513. The retriever would pull the heading without the details, or the details without the heading. The LLM got fragments and hallucinated to fill the gaps.

What we do now: Semantic chunking based on document structure. For medical documents, we split at section boundaries (Chief Complaint, History, Medications, etc.) using regex + NLP, not fixed token counts. For PDFs without clear structure, we use recursive character text splitting with overlap tuned per document type, not a global default.

If your chunks have orphaned headings or context-free numbers, your chunking strategy is wrong. Audit 20 random chunks before moving on.

Mistake 2: Vector Search Only

My first instinct was: embed everything, use cosine similarity, done. It worked on my test set. It failed on production queries.

A radiologist asked: "Show me all reports mentioning 'ground glass opacity' from Q3." Semantic search retrieved reports about lung pathology in general. The specific phrase wasn't matched because its vector representation was similar to dozens of other lung terms. Keyword search would've nailed it instantly.

What we do now: Hybrid retrieval. BM25 for exact term matching, vector search for semantic similarity, then Reciprocal Rank Fusion to merge the results. For our RadShifts deployment, hybrid retrieval cut false negative rate by 34% compared to vector-only. That's not a small number when the output is a clinical document.

Mistake 3: No Retrieval Validation Before Generation

Here's a failure mode I didn't anticipate: the retriever returns results, but they're all irrelevant. The LLM doesn't know that. It generates a confident, detailed, completely wrong answer from the retrieved garbage.

On one healthcare client, a query about "aspirin contraindications" was returning cardiology records about aspirin therapy. Technically related. Completely wrong context. The LLM synthesized a response that mixed treatment instructions with contraindications. Nobody caught it for two weeks.

What we do now: A retrieval validation layer between the retriever and the generator. We score each retrieved chunk against the query (cross-encoder reranking), set a minimum relevance threshold, and route low-confidence retrievals to a fallback. The fallback either returns a "not found" response or escalates to a human reviewer. The LLM never sees irrelevant context.

Mistake 4: Embedding Drift

Six months after launch, our semantic search accuracy dropped noticeably. The documents hadn't changed. The queries hadn't changed. What changed: the embedding model provider had pushed an update.

The new model produced slightly different vector representations for the same text. Our indexed embeddings from six months ago were now in a different vector space than fresh query embeddings. Retrieval quality degraded because we were comparing apples to oranges at the vector level.

What we do now: We pin embedding model versions by name and hash. When a new model version is available, we re-embed the entire corpus before switching. We also track embedding drift as a metric in production: periodically re-embed a sample of documents and compare cosine similarity against the original embeddings. If drift exceeds a threshold, we trigger a full re-index.

Mistake 5: No Document Versioning

Legal documents change. Medical protocols get updated. Product specifications get revised. Our RAG systems didn't know that.

We'd index version 1 of a contract. The client would update the contract. We'd index version 2. Now the corpus contained contradictory information from the same document. The LLM would retrieve both versions and synthesize an answer that was partly right and partly wrong, with no way for the user to know which version was used.

What we do now: Document versioning as a first-class citizen. Every document gets a version field and a validity period. When a new version is ingested, the old version is either archived (removed from active retrieval) or marked as superseded. Queries against an archived document return the new version with a note that the source was updated.

Mistake 6: Treating Permissions As an Afterthought

This one almost cost us a client relationship. I won't name them, but the domain was financial services.

Our RAG system was deployed across multiple user roles. A junior analyst ran a query. The retriever pulled a highly relevant document. The document was a board-level financial projection not cleared for analyst access. We'd implemented permissions at the UI layer. The retrieval layer didn't know about permissions at all.

What we do now: Permissions applied at query time, at the retrieval layer, not at display time. Every document in the index carries metadata about who can access it. Every query is tagged with the requesting user's role and permissions. The retriever filters candidates by permission before ranking. The LLM never sees a document the user isn't authorized to read.

Mistake 7: Flying Blind in Production

We shipped. We got positive feedback. We assumed the system was working.

Three months in, a client mentioned that questions about a specific topic were getting vague answers. We looked at the logs. Query-to-retrieval latency had doubled. Retrieval quality had degraded across a whole document category. Neither of us had noticed because we weren't tracking it.

What we do now: Four metrics in production for every RAG system:

Retrieval precision: Are the top-k results actually relevant to the query?
Answer groundedness: Is the generated answer supported by the retrieved context? (Use an LLM-as-judge to score this)
Query latency (p95): Not average. p95. Averages hide the tail.
Retrieval gap rate: What percentage of queries return no results or low-confidence results?

Set up alerts on all four before launch. Not after.

Summary Table

Mistake	Symptom	Fix
Fixed-size chunking	Fragmented, context-free retrievals	Semantic chunking by document structure
Vector search only	Exact terms missed	Hybrid BM25 + vector + RRF
No retrieval validation	Confident wrong answers	Cross-encoder reranking + minimum relevance threshold
Embedding drift	Silent accuracy degradation	Pin model versions, monitor vector drift
No document versioning	Contradictory answers	Version field + validity period + archive on update
Permissions afterthought	Unauthorized content retrieved	Filter by permissions at retrieval layer, not display layer
No production monitoring	Problems found by users, not by you	4 core metrics: precision, groundedness, latency, gap rate

One More Thing

These are mistakes I made on real systems. Most RAG tutorials show you a working demo. They don't show you what breaks at 3am six months after launch.

If you're building a RAG system for production, the architecture decisions that matter most aren't which vector database to use. They're the ones above.

Full production RAG architecture guide, including chunking strategy, index design, and the seven failure points in order of impact: Building a Production RAG System

Aashir Tariq is the CEO of Afnexis, an AI development company that's shipped 50+ production AI systems for healthcare, fintech, and real estate clients.

DEV Community