DEV Community

Shift AI
Shift AI

Posted on

Production RAG: Lessons from Real Deployments

Everyone's building RAG (Retrieval-Augmented Generation) systems. Most won't survive production. Here's what works.

Why RAG Breaks in Production

Tutorials make it easy: chunk, embed, prompt. Real data breaks everything.

Common failures:

  1. Chunking destroys context — tables and references split across chunks
  2. Embedding drift — new docs don't align with old embeddings
  3. Retrieval-generation gap — LLM answers confidently from the wrong chunk

Patterns That Work

Hierarchical Chunking

Don't chunk by token count. Use document structure:

def smart_chunk(document):
    sections = split_by_headers(document)
    paragraphs = flatten_paragraphs(sections)
    sentences = extract_key_sentences(paragraphs)
    return sections + paragraphs + sentences
Enter fullscreen mode Exit fullscreen mode

Re-ranking After Retrieval

Vector similarity is a rough filter. Add cross-encoder re-ranking:

candidates = vector_store.search(query, top_k=20)
reranked = cross_encoder.rank(query, candidates)
context = reranked[:5]
Enter fullscreen mode Exit fullscreen mode

Citation Tracking

Force the model to cite [Source: document, section] for every claim. This builds user trust and makes debugging straightforward.

Monitor From Day One

Track: retrieval relevance (human review weekly), answer faithfulness (LLM-as-judge), user feedback ratio, P95 latency.

When NOT to Use RAG

  • Small stable knowledge base → fine-tune instead
  • Need sub-200ms responses → pre-compute answers
  • 100% accuracy required → use structured search

RAG is powerful but demanding. If you're building RAG systems for production, invest in the boring infrastructure first — your users will thank you.

Top comments (0)