Most RAG tutorials get you to a prototype in 30 minutes. Most production RAG systems fail in ways those tutorials never prepare you for. After building several RAG pipelines, here are the real problems and how to fix them.
The demo problem
The basic RAG loop looks simple:
- Chunk documents → embed chunks → store in vector DB
- At query time: embed query → find similar chunks → stuff into prompt
This works great on the demo dataset. It fails in production because:
- Chunk boundaries cut context in half
- Retrieval returns semantically similar but contextually wrong chunks
- The LLM hallucinates when retrieved context is insufficient
- Performance degrades as the knowledge base grows
Problem 1: Naive chunking destroys context
The default CharacterTextSplitter(chunk_size=1000, chunk_overlap=200) pattern blindly splits on character count. It will cut a code example in half, split a numbered list between items 3 and 4, and separate a table header from its rows.
Better: semantic chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=[
"\n## ", # Split on H2 headers first
"\n### ", # Then H3
"\n\n", # Then paragraphs
"\n", # Then lines
". ", # Then sentences
" ", # Last resort: spaces
],
length_function=len,
)
Better: preserve document structure as metadata
def chunk_with_metadata(document: str, source: str) -> list[dict]:
chunks = splitter.split_text(document)
return [
{
"content": chunk,
"metadata": {
"source": source,
"chunk_index": i,
"total_chunks": len(chunks),
"section": extract_current_section(chunk, document),
}
}
for i, chunk in enumerate(chunks)
]
Problem 2: Vector similarity isn't enough
Pure cosine similarity retrieval has a well-known failure mode: it finds chunks that are topically similar but not the ones that answer the question.
Query: "How do I cancel my subscription?"
Top vector match: "Subscription plans include monthly and annual billing options"
Actual answer: "To cancel, go to Settings → Billing → Cancel subscription"
The relevant chunk scores lower because it uses different vocabulary.
Fix: hybrid retrieval (vector + BM25)
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 4
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Ensemble with reciprocal rank fusion
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6], # Tune these per your domain
)
BM25 finds exact keyword matches. Vector search finds semantic matches. The ensemble finds both.
Fix: re-ranking
After retrieval, re-rank chunks with a cross-encoder:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
pairs = [(query, chunk) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, chunks), reverse=True)
return [chunk for _, chunk in ranked[:top_k]]
This is cheap (small model, fast inference) and dramatically improves precision.
Problem 3: No feedback loop
Your RAG system is blind without measurement. At minimum, track:
from dataclasses import dataclass, field
from typing import Optional
import time
@dataclass
class RAGTrace:
query: str
retrieved_chunks: list[str]
retrieval_scores: list[float]
response: str
latency_ms: int
user_feedback: Optional[bool] = None # thumbs up/down
timestamp: float = field(default_factory=time.time)
# Analyze:
# - Average retrieval score for queries that got negative feedback
# - Chunks retrieved often but never produce positive feedback (stale/wrong)
# - Query types that consistently fail
Problem 4: Embedding model drift
If you update your embedding model, old vectors are incompatible. Track the model used per chunk:
{
"content": "...",
"embedding": [...],
"metadata": {
"embedding_model": "text-embedding-3-small",
"embedding_model_version": "1",
"indexed_at": "2026-04-07T18:00:00Z",
}
}
On model upgrade, re-embed in a new namespace and run both in parallel until confidence builds.
Problem 5: Context stuffing without ordering
Naively concatenating top-k chunks fails when chunks contradict each other or are from different sections. Use map-reduce for large retrievals:
# Map: extract relevant info from each chunk independently
# Reduce: synthesize the extracted pieces into a coherent answer
map_prompt = """Given this excerpt, extract information relevant to: {question}
Excerpt: {docs}
Relevant info (or "NOT RELEVANT"):"""
reduce_prompt = """Answer this question: {question}
Relevant excerpts:
{doc_summaries}
Comprehensive answer (say "insufficient information" if needed):"""
The production RAG checklist
- [ ] Semantic/structural chunking, not naive character split
- [ ] Overlap preserves sentence boundaries
- [ ] Hybrid retrieval (vector + BM25)
- [ ] Re-ranking with cross-encoder
- [ ] Chunk metadata: source, section, timestamp
- [ ] Retrieval tracing logged per query
- [ ] User feedback loop (even just thumbs up/down)
- [ ] Embedding model versioned per chunk
- [ ] Stale document removal pipeline
- [ ] Evaluation set with ground-truth Q&A pairs
Building AI features into your SaaS? The AI SaaS Starter Kit at whoffagents.com includes a pre-built RAG pattern with pgvector, Next.js API routes, and streaming responses — so you skip the infrastructure and ship the feature.
Top comments (0)