Production RAG: Lessons from Real Deployments

#python

Everyone's building RAG (Retrieval-Augmented Generation) systems. Most won't survive production. Here's what works.

Why RAG Breaks in Production

Tutorials make it easy: chunk, embed, prompt. Real data breaks everything.

Common failures:

Chunking destroys context — tables and references split across chunks
Embedding drift — new docs don't align with old embeddings
Retrieval-generation gap — LLM answers confidently from the wrong chunk

Patterns That Work

Hierarchical Chunking

Don't chunk by token count. Use document structure:

def smart_chunk(document):
    sections = split_by_headers(document)
    paragraphs = flatten_paragraphs(sections)
    sentences = extract_key_sentences(paragraphs)
    return sections + paragraphs + sentences

Re-ranking After Retrieval

Vector similarity is a rough filter. Add cross-encoder re-ranking:

candidates = vector_store.search(query, top_k=20)
reranked = cross_encoder.rank(query, candidates)
context = reranked[:5]

Citation Tracking

Force the model to cite [Source: document, section] for every claim. This builds user trust and makes debugging straightforward.

Monitor From Day One

Track: retrieval relevance (human review weekly), answer faithfulness (LLM-as-judge), user feedback ratio, P95 latency.

When NOT to Use RAG

Small stable knowledge base → fine-tune instead
Need sub-200ms responses → pre-compute answers
100% accuracy required → use structured search

RAG is powerful but demanding. If you're building RAG systems for production, invest in the boring infrastructure first — your users will thank you.

DEV Community