Everyone's building RAG (Retrieval-Augmented Generation) systems. Most won't survive production. Here's what works.
Why RAG Breaks in Production
Tutorials make it easy: chunk, embed, prompt. Real data breaks everything.
Common failures:
- Chunking destroys context — tables and references split across chunks
- Embedding drift — new docs don't align with old embeddings
- Retrieval-generation gap — LLM answers confidently from the wrong chunk
Patterns That Work
Hierarchical Chunking
Don't chunk by token count. Use document structure:
def smart_chunk(document):
sections = split_by_headers(document)
paragraphs = flatten_paragraphs(sections)
sentences = extract_key_sentences(paragraphs)
return sections + paragraphs + sentences
Re-ranking After Retrieval
Vector similarity is a rough filter. Add cross-encoder re-ranking:
candidates = vector_store.search(query, top_k=20)
reranked = cross_encoder.rank(query, candidates)
context = reranked[:5]
Citation Tracking
Force the model to cite [Source: document, section] for every claim. This builds user trust and makes debugging straightforward.
Monitor From Day One
Track: retrieval relevance (human review weekly), answer faithfulness (LLM-as-judge), user feedback ratio, P95 latency.
When NOT to Use RAG
- Small stable knowledge base → fine-tune instead
- Need sub-200ms responses → pre-compute answers
- 100% accuracy required → use structured search
RAG is powerful but demanding. If you're building RAG systems for production, invest in the boring infrastructure first — your users will thank you.
Top comments (0)