RAG Is Not a Feature. It's a System, and These Are the Parts Nobody Demos.

#ai #machinelearning #architecture #backend

Retrieval-Augmented Generation demos beautifully. Embed your documents, run a similarity search, drop the results into the prompt, and the model answers questions over your data. Ship it, and it works right up until real users ask real questions, at which point the answers get subtly, confidently wrong. The demo hid every decision that actually determines quality. Here are the parts that separate a RAG demo from a RAG system.

Chunking is a design decision, not a default

Splitting documents on a fixed token count is the default in every tutorial and it quietly destroys quality. Fixed windows cut tables in half, separate a clause from the sentence that qualifies it, and orphan headings from the text they describe. Structure-aware chunking, splitting on semantic boundaries like sections, list items, or function definitions, consistently does better. The right chunk size is empirical. Measure it; do not inherit it.

Pure vector search is not enough

Embeddings are great at "find me something similar" and surprisingly bad at "find the document containing error code E-4021." Exact identifiers, product codes, and rare terms are exactly where semantic search whiffs. Hybrid retrieval fixes most of this: run dense vector search and a sparse keyword index (BM25) together, then rerank the merged set. The keyword half catches the exact matches the vectors miss.

Grounding is the difference between answer and hallucination

If your model can produce an answer without citing which retrieved chunk supports it, you have no way to detect hallucination in production. Force citations, then validate them.

def validate_grounding(answer, citations, retrieved_chunks):
    for cid in citations:
        if cid not in retrieved_chunks:
            return False           # model invented a source
    if not citations and makes_factual_claim(answer):
        return False               # unsupported claim
    return True

This runs on every request and turns an invisible failure into a catchable one.

Retrieval respects permissions or it leaks

The moment your corpus contains anything access-controlled, retrieval becomes a security surface. Filtering results after the search is fragile. Scope the query itself with the requesting user's permissions so unauthorized chunks are never candidates for the context window in the first place. A retrieval bug that surfaces the wrong customer's data is not a quality issue; it is an incident.

Evaluation, or you are flying blind

Every knob above (chunk size, retrieval mix, reranking, prompt) changes output quality in ways you cannot eyeball. You need a versioned evaluation set that answers "did that change help or hurt?" on every adjustment. A few dozen well-chosen cases with faithfulness and retrieval-hit metrics catch a startling number of regressions. Without this, every improvement is a guess and quality drifts.

It compounds into real architecture

Put these together and RAG stops looking like a feature and starts looking like a subsystem: ingestion and chunking, a hybrid index, a retrieval layer with per-user scoping, a grounding validator, and an evaluation harness in CI. That is a lot more than "embed and search," and it is why serious enterprise AI solutions treat retrieval as core infrastructure rather than a wrapper around a vector database.

The takeaway

The gap between a RAG demo and a RAG system is measured in chunking strategy, hybrid retrieval, grounding, access control, and evaluation. None of it shows up in the five-minute demo, and all of it determines whether the thing is trustworthy in production. If you are building this into a product rather than a prototype, it is worth treating it with the same rigor as any other custom AI application development effort, because that is exactly what it is.

What broke first in your RAG system? Mine was chunking. Trade stories below.