Most RAG (Retrieval-Augmented Generation) tutorials show you how to throw documents into a vector store, retrieve the top-K results, and send them to an LLM.
That works for a demo. It falls apart in production.
Here's what actually matters when you're building a RAG system that real users depend on — and the patterns I've settled on after 20 years of building ML systems.
1. Chunking Strategy Is Everything
The default approach — splitting text into fixed-size chunks — is the worst option for most use cases.
Why it fails: Fixed chunking splits sentences mid-thought, breaks paragraphs apart, and creates chunks with no semantic coherence. Your retrieval quality tanks because the embeddings represent fragments, not ideas.
What to do instead:
Use recursive chunking — split on natural boundaries first (double newlines → single newlines → sentences → words), and only fall back to fixed splitting when a section is still too large.
def recursive_chunks(text, chunk_size=512, overlap=50):
separators = ["\n\n\n", "\n\n", "\n", ". ", " "]
def split(text, sep_index=0):
if len(text.split()) <= chunk_size:
return [text] if text.strip() else []
if sep_index >= len(separators):
return fixed_chunks(text, chunk_size)
parts = text.split(separators[sep_index])
chunks, current = [], ""
for part in parts:
candidate = f"{current}{separators[sep_index]}{part}" if current else part
if len(candidate.split()) <= chunk_size:
current = candidate
else:
if current:
chunks.extend(split(current, sep_index + 1))
current = part
if current:
chunks.extend(split(current, sep_index + 1))
return chunks
return split(text)
For documents with clear structure (legal docs, technical manuals), semantic chunking — splitting on paragraph boundaries and merging small paragraphs — works even better.
2. Similarity Threshold > Top-K
Most implementations just return the top-K results. This is a mistake.
If a user asks a question that has nothing to do with your documents, top-K will still return K results — they'll just be irrelevant. Your LLM then hallucinates an answer based on unrelated context.
Fix: Apply a similarity threshold. Only return results above a minimum score.
def search(self, query, top_k=5, threshold=0.7):
results = self.vector_store.search(query, top_k=top_k)
return [r for r in results if r["score"] >= threshold]
This one change dramatically reduces hallucinations. If nothing passes the threshold, tell the user you don't have enough information — that's a better outcome than a confident wrong answer.
3. Re-Ranking Is a Force Multiplier
Embedding-based retrieval is fast but imprecise. It finds semantically similar content, but "similar" doesn't always mean "relevant to the specific question."
The pattern: Over-fetch (3x your target), then re-rank.
A full cross-encoder re-ranker (like ms-marco-MiniLM) gives the best results, but even a simple term-overlap re-ranker helps:
def rerank(query, results, top_k=3):
query_terms = set(query.lower().split())
for r in results:
content_terms = set(r["content"].lower().split())
overlap = len(query_terms & content_terms)
density = overlap / max(len(content_terms), 1)
r["rerank_score"] = (r["score"] * 0.7) + (density * 0.3)
results.sort(key=lambda x: x["rerank_score"], reverse=True)
return results[:top_k]
Retrieve 15, re-rank to 5. Your answer quality jumps significantly.
4. Stream Your Responses
If your RAG system takes 3-5 seconds to respond (embedding + retrieval + LLM generation), users will think it's broken.
Streaming sends tokens as they're generated. The user sees the answer forming in real-time, which feels fast even if the total time is the same.
async def stream_response(query, context):
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
],
stream=True,
)
async for chunk in stream:
if chunk.choices[0].delta.content:
yield f"data: {chunk.choices[0].delta.content}\n\n"
5. Configuration via Environment, Not Code
Hardcoding your chunk size, model name, similarity threshold, and vector store choice is fine for a prototype. In production, you need to tune these without redeploying.
Use Pydantic Settings:
class Settings(BaseSettings):
chunk_size: int = 512
chunk_overlap: int = 50
embedding_model: str = "BAAI/bge-small-en-v1.5"
vector_store_type: str = "chroma" # or "faiss"
top_k: int = 5
similarity_threshold: float = 0.7
rerank: bool = True
llm_model: str = "gpt-4o-mini"
model_config = {"env_file": ".env", "env_prefix": "RAG_"}
Change any parameter by editing .env or setting an environment variable. No code changes, no redeployment.
Putting It All Together
A production RAG system isn't much more code than a tutorial one — it's just better code in the right places:
- Recursive chunking instead of fixed splitting
- Similarity thresholds to prevent hallucinations
- Re-ranking to improve relevance
- Streaming for perceived performance
- Environment-based configuration for operational flexibility
I packaged all of these patterns (and more — Docker configs, file upload endpoints, multiple vector store backends) into a ready-to-use template.
If you want to skip the boilerplate and start with production-quality code: Production AI/ML Toolkit — 4 Ready-to-Ship Templates
It includes a complete RAG system plus templates for LLM fine-tuning, model serving with A/B testing, and an AI agent framework. $39.
What patterns have you found essential for production RAG? Drop them in the comments.
Top comments (0)