aashna mahajan

Posted on Jun 15

RAG in 8 Layers: The Production Mental Model Most Tutorials Skip

#ai #python #vectordatabase #rag

Have you ever shipped something built on RAG and a week later watched it break in a way no tutorial prepared you for? Confident answers, real citations, wrong conclusions, and nothing obviously wrong in your logs.

Most RAG tutorials teach you how to build a demo. This article is about what breaks after the demo works.

I’ve hit that wall enough times to stop blaming the LLM first and start inspecting each layer of the pipeline.

To make this concrete, I'll use the same running example throughout: an on-call AI assistant that helps engineers debug incidents in real time. It ingests runbooks, past incident postmortems, internal architecture docs, and recent alerts. When something pages you at 3am, you can ask, "redis_p99 latency spiking, what do I check first?" and get an actually useful answer grounded in your team's docs.

Layer 1: Tokenization — Your Text Isn't Text to the Model

The first time tokenization broke a RAG pipeline I'd built, I'd indexed about 3,000 documents and couldn't figure out why retrieval was getting worse the longer the document was. My embedding model's tokenizer was silently clipping the last ~15% of every long chunk. I had to re-ingest everything.

That's when I learned: before a single word enters an embedding model, it gets converted to tokens. And tokenization shapes everything downstream — how your chunks behave and how the model reads a sentence.

Think of it like Scrabble tiles. English has ~170,000 words, but you don't get one tile per word. You get a fixed set of tiles covering common letter patterns. "unbelievable" might be three tiles: "un", "believ", "able". Technical vocabulary often fragments more aggressively because it appears less often in training data.

Most modern tokenizers work by repeatedly merging common character patterns into reusable tokens. For RAG, the exact tokenizer algorithm matters less than the operational consequence: your chunk size is measured in model tokens, not human words. If you don’t measure that, your retrieval corpus may already be damaged before indexing begins.

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

text = "Retrieval-augmented generation uses vector embeddings"
tokens = enc.encode(text)
decoded = [enc.decode([t]) for t in tokens]

print(f"Token count: {len(tokens)}")
print(decoded)
# ['Ret', 'riev', 'al', '-', 'augmented', ' generation', ' uses', ' vector', ' embed', 'dings']

A "512-token chunk" is not 512 words. It's ~350–400 words of normal prose, far fewer for code or domain jargon. If your embedding model has a 512-token limit and you feed it 600-token chunks, it may silently truncate the tail. No error. Just missing context.

Runbooks are a worst-case scenario for this. Error codes tokenize cleanly, but inline shell commands fragment into 8+ tokens each. A runbook chunk that looks small in word count can blow past your token limit fast—and the part that gets clipped is usually the fix.

In any production RAG pipeline, this is the layer that quietly decides whether your long documents survive intact. Log your token counts during ingestion — every time.

⚡ Production takeaway: Always log token counts during ingestion. Silent truncation can quietly remove the exact context your model needs.

Tokenization sets the floor. The next decision determines whether your text becomes searchable at all: how you cut documents into chunks.

Layer 2: Chunking — The Decision That Breaks Pipelines Before Retrieval

Here's something I wish someone had told me early: none of the other layers matter if your chunks are bad. I've seen teams spend two weeks tuning embedding models and rerankers when the real problem was that their chunks were splitting tables in half. Get chunking right first. Everything else is optimization on top of that foundation.

Think of it like cutting a textbook into flashcards. Too big and each card is overwhelming. Too small and "it refers to the previous entity" on a flashcard tells you nothing.

Fixed-size chunking

The example below uses LangChain only to show the idea; the same chunking strategy applies in custom ingestion jobs too.

from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=256, chunk_overlap=32)
chunks = splitter.split_text(document)

For most document-style RAG pipelines, overlap is safer than hard cuts. Without it, sentences straddling boundaries get their meaning split across two chunks—neither is retrieved correctly.

Fixed-size chunking is simple and predictable, but it can cut through tables, lists, procedures, or paragraphs without understanding the document structure.

Recursive chunking

Try to split on paragraph breaks first, then sentences, then words, then characters. Produces semantically coherent chunks without embedding every sentence. Solid default for any RAG pipeline. Another option is semantic chunking, where meaning itself decides where to split.

Semantic chunking

Embed every sentence. Find where cosine similarity between adjacent sentences drops sharply — that's a topic transition. Split there.

Semantic chunking sounds better than it performs in most real projects. The quality gain is real but the ingestion cost surprises teams every time. I'd reach for recursive splitting first, ship it, measure retrieval quality, and only upgrade if the metrics tell you to. The pattern I usually prefer for production document RAG is parent-child chunking.

Parent-Child chunking: my default production recommendation

Index small chunks (128 tokens) for retrieval precision. Return the large parent chunk (512 tokens) to the LLM for generation context.

small_chunk = vectorstore.similarity_search(query, k=5)[0]
parent_id = small_chunk.metadata["parent_id"]
full_context = parent_store.get(parent_id)

Small chunks match precisely. Larger parent chunks give the LLM enough context to answer coherently. It is a strong default before reaching for anything fancier.

The tradeoff is that parent chunks can add extra context, so you still need to keep parent size bounded.

For our on-call agent, this is perfect for step-by-step runbooks. Index each step as a small chunk so retrieval can pinpoint the exact action, but return the full runbook section so the engineer has the surrounding context for why that action matters.

⚡ Production takeaway: Bad chunks break retrieval before your embedding model, vector DB, or reranker ever gets a fair chance.

Now you have well-formed chunks. The next layer turns them into something searchable.

Layer 3: Embeddings — Coordinates in Meaning Space

An embedding turns text into a vector: hundreds or thousands of floating-point numbers that represent meaning. Semantically similar texts end up geometrically close in that space. The classic intuition is king - man + woman ≈ queen — embeddings can encode semantic relationships as geometric patterns, even if real embedding spaces are much messier than the example suggests.

But retrieval representations are not all the same. In practice, there are three patterns worth understanding.

Sparse lexical retrieval: BM25 and TF-IDF uses vocabulary-sized vectors where most values are zero. Only the words that actually appear in the text are non-zero.

BM25 is a ranking function that scores documents based on how often query terms appear, how rare those terms are, and how long the document is. It is widely used in search engines because it is fast, interpretable, and devastatingly good at exact keyword matching.

It also fails the moment the user and the document use different words for the same idea.

I went through a stretch where almost every retrieval bug I debugged came down to a BM25 vs dense retrieval mismatch. The user used a synonym, BM25 returned nothing, and dense retrieval surfaced something semantically related but not precise enough. Once you know what each method does well, you stop guessing.

This matters acutely for our on-call agent. Exact error codes like ECONNREFUSED, OOMKilled, and 503 need lexical search to surface fast. Symptom descriptions like “database queries timing out under load” need dense embeddings to match the runbook that says “connection pool exhaustion.” Neither alone covers the queries you would actually type at 3am.

Dense embeddings (BERT, Sentence Transformers) produce a single compact vector per text — every dimension non-zero, semantic meaning packed in. This is what most people mean when they say "embedding model."

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

docs = [
    "The automobile industry is shifting to electric vehicles",
    "Car manufacturers are investing in battery technology",
    "BM25 is a keyword-based retrieval algorithm",
]

query = "battery-powered auto makers"

doc_vectors = model.encode(docs)
query_vector = model.encode([query])

similarities = np.dot(doc_vectors, query_vector.T).flatten()
print(similarities)
# [0.721, 0.683, 0.201]

Even when the wording changes, dense embeddings can still surface the right documents. BM25 is much more brittle here because it depends on lexical overlap.

The weakness goes the other way: dense models can miss exact matches. Ask for GPT-4o, and a dense retriever might surface documents about GPT-3 because they are semantically close, even though the exact version matters.

Late interaction (ColBERT) keeps a separate vector per token instead of compressing the whole text into one vector. At query time, each query token finds its best-matching document token, and those matches are combined into a final score. This can improve precision on long documents because the model does not lose as much token-level detail. The tradeoff is storage and serving complexity.

I evaluated ColBERT on a project where retrieval quality was the bottleneck. The benchmarks were genuinely better — about 8% improvement on our test set. The storage cost was 14x. We shipped dense retrieval plus a cross-encoder reranker instead, got within 2% of ColBERT’s quality, and kept the infrastructure bill sane. Worth knowing it exists. It's also worth being honest that dense retrieval plus reranking covers many production use cases at a fraction of the complexity.

The model choice matters, but the bigger lesson is this: benchmark on your own queries. Leaderboards do not know your documents, your users, or your failure modes.

⚡ Production takeaway: Dense embeddings are great for meaning. BM25 is great for exact terms. Production systems often need both.

You can now turn chunks into vectors. The next problem is finding the right ones in under 100ms when you have millions of them.

Layer 4: Vector Storage & Indexing — How ANN Search Actually Works

A pipeline I shipped last year worked great in testing—sub-second responses and accurate answers. Six months later, after we’d ingested about 5x more documents, queries were taking 7–8 seconds. Nothing in our code had changed. I spent two days assuming the embedding model had degraded somehow before realizing the vector index was the bottleneck.

I’d never bothered to understand what was actually inside the vector database I was calling—and when it broke, I had no framework to debug it.

Most RAG tutorials hand-wave this layer: “Just use Pinecone.” That’s fine until your pipeline slows down, retrieves the wrong context, or leaves the model to guess because recall is too low.

With millions of embeddings, brute-force comparison quickly becomes too slow for interactive systems. Production systems usually need Approximate Nearest Neighbor (ANN) indexing. ANN means: instead of checking every vector, the system finds a near-best match much faster, trading a small amount of recall for a large speedup.

HNSW (Hierarchical Navigable Small World) is one of the most common index types used by vector databases. It builds a multi-layer graph: the top layer has fewer nodes with long-distance connections, like highways, and the bottom layer has denser connections, like local streets. A query enters at the top, greedily navigates toward the answer, then descends layer by layer for finer resolution.

What "tuning the index" actually means

When teams say "tune the index," they usually mean tuning the tradeoff between recall, latency, memory, and build time. The exact knobs depend on the indexing algorithm your vector database uses.

For an HNSW-based index, common knobs include:

M / m — how connected each vector is in the graph. Higher values usually improve recall but use more memory.
ef_construction / efConstruction — how much effort the index spends building the graph. Higher values can improve index quality but make indexing slower.
ef_search / efSearch / hnsw_ef — how many candidates the search explores at query time. Higher values usually improve recall but increase latency.

import hnswlib

index = hnswlib.Index(space="cosine", dim=384)
index.init_index(max_elements=1_000_000, ef_construction=200, M=16)
index.set_ef(50)  # Tune at query time for recall vs latency tradeoff

Other index types expose different knobs. IVF (Inverted File Index) splits the vector space into clusters and only searches the nearest ones — it tunes how many clusters to probe with settings like nprobe. Quantized indexes tune compression settings. Some managed vector databases hide most of these details and expose higher-level settings instead.

In production, you also need to think about metadata filters — searching only runbooks for the affected service, region, environment, or time window — because filtering can change both latency and recall.

The important idea is not the parameter name. It’s the tradeoff: better recall usually costs more memory, more indexing time, or more query latency.

In plain English: vector indexes are shortcuts. They avoid comparing your query to every single vector by searching only the most promising parts of the space. The more shortcuts you take, the faster the search becomes, but the more likely you are to miss the best result. Index tuning is about deciding how much accuracy you're willing to trade for speed.

For our on-call agent, this is non-negotiable. Retrieval needs to be fast enough that the full assistant still feels interactive. If vector search alone takes 8 seconds, the engineer has already opened the runbook manually.

⚡ Production takeaway: Vector indexes are speed-recall tradeoffs. Tune them only after measuring latency and retrieval quality.

You can now retrieve the right vectors quickly. The next question is: which retrieval strategy?

Layer 5: Retrieval Strategies — Hybrid Search and MMR

The first production issue I hit on my second pipeline was a user typing “PTO policy” against a document that said “annual leave entitlement.” Dense retrieval found it, but buried it at rank 4. BM25 had nothing useful to contribute because the words did not overlap.

Technically, the system worked. Practically, it failed — nobody scrolls through five RAG citations during a real task.

Hybrid search moved the right document into the visible results, and that was the moment I stopped trusting either retrieval method alone.

Hybrid Search via Reciprocal Rank Fusion

Run BM25 and dense retrieval in parallel. Merge with Reciprocal Rank Fusion (RRF) — a simple algorithm that combines multiple ranked lists into one by rewarding documents that appear high in any list:

def reciprocal_rank_fusion(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

bm25_results  = ["doc3", "doc1", "doc5"]
dense_results = ["doc1", "doc2", "doc3"]
fused = reciprocal_rank_fusion([bm25_results, dense_results])
# ["doc1", "doc3", "doc2", "doc5"]

A document that ranks high in both lists beats one that ranks first in only one. BM25 catches exact matches dense misses. Dense catches synonyms BM25 ignores. The combined ranking often beats either alone, especially when your queries mix exact terms and semantic descriptions.

Picture it for our on-call agent: an engineer types "redis_p99 latency spiking" against a runbook titled "Redis tail latency investigation." Dense gets close (rank 4). BM25 misses entirely (no shared keywords). Hybrid surfaces it at rank 1 because both methods voted for it from different angles.

MMR (Maximal Marginal Relevance)

MMR is a ranking strategy that picks results which are both relevant to the query and dissimilar from each other.

In practice, it's most useful when your corpus contains repeated versions of the same concept — multiple postmortems on the same incident, similar runbooks, duplicated wiki pages, copied troubleshooting steps. Without MMR, your top-5 ends up being five near-duplicates of the same chunk.

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.7},
)

lambda_mult is the dial — closer to 1 prioritizes relevance, closer to 0 prioritizes diversity. 0.7 is a sensible default.

⚡ Production takeaway: Hybrid search and MMR help your system retrieve both relevant and diverse context, instead of five near-duplicates of the same chunk.

Hybrid search gives you a good top-20. The next layer turns that into a great top-5.

Layer 6: Re-Ranking — The Precision Pass

This is the layer that gave me the biggest single quality jump in any RAG pipeline I’ve built.

Before reranking, our RAGAS faithfulness score was 0.71. After adding a cross-encoder reranker over a top-20 candidate set, it jumped to 0.86 — bigger than the gain we got from switching to a larger embedding model and tuning chunk sizes combined. This is the one change I'd push for first in any pipeline review.

RAGAS is an evaluation framework for RAG systems; I’ll come back to it in Layer 8. For now, the important point is that reranking changed the quality of what we sent to the LLM.

Your bi-encoder search computes query and document vectors independently. That makes it fast, but imprecise, because the model never sees the query and document together.

A cross-encoder is a model that takes the query and a document as a single input and scores how well they match — slower than independent encoding, but far more accurate because it can attend to the interaction between them directly.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
candidates = [chunk1, chunk2, chunk3, ...]  # Top-20 from Layer 5
scores = reranker.predict([(query, doc) for doc in candidates])
reranked = sorted(zip(candidates, scores), key=lambda x: -x[1])

The tradeoff is latency, you can't run a cross-encoder over millions of documents, only over the top-20 you already retrieved. Retrieve broadly, rerank precisely.

For our on-call agent, this is the difference between "here are 5 things that might be related" and "here's exactly what to check first." Reranking is what makes the assistant feel useful instead of like a fancy search bar.

⚡ Production takeaway: Retrieve broadly, rerank precisely, and only send the best chunks to the LLM.

Layer 7: Query Transformation — HyDE and RAG Fusion

Sometimes retrieval fails not because your chunks, embeddings, or index are bad, but because the user asks the question in a different language than your documents use.

Your document might say “Q3 churn reduction tactics.” Your user might ask, “How should we approach customer retention?” Your runbook might say “Redis memory pressure investigation.” Your on-call engineer might type, “Is our cache dying?”

Query transformation tries to close that vocabulary gap before retrieval happens.

RAG Fusion

RAG Fusion generates several reformulations of the original query, retrieves results for each one, and fuses the rankings into a single result list.

The goal is simple: do not let one bad phrasing sink retrieval.

queries = [original] + llm.invoke(
    f"Generate 4 reformulations of: {original}"
).content.split("\n")
all_results = [retrieve(q) for q in queries]
fused = reciprocal_rank_fusion(all_results)

One reformulation often gets closer to the document’s vocabulary. Another may preserve the user’s original wording. Fusing them gives retrieval multiple chances to find the right context.

HyDE (Hypothetical Document Embeddings)

HyDE is a technique where the LLM first generates a hypothetical answer to the user’s question. You then embed that hypothetical answer and search with it instead of the raw question.

It sounds backwards. It often works because the hypothetical answer may use language closer to your documents than the original query does.

I’ll be honest: HyDE felt like a gimmick when I first read about it. Then I had a project where users asked questions like “how should we approach customer retention?” against a corpus of operational documents that said things like “Q3 churn reduction tactics include…” The vocabulary mismatch was killing retrieval.

HyDE generated a hypothetical answer in operational language, embedded that, and suddenly the right documents started surfacing.

It is not always the answer. But when query vocabulary diverges sharply from document vocabulary, it is one of the cleanest fixes I know.

The on-call agent version: an engineer types, “Is our cache dying?” That does not match anything in your runbooks. But HyDE might generate a hypothetical answer like, “Redis memory pressure investigation steps include…” and suddenly the right runbook surfaces.

The panicked, conversational query becomes a technical one.

The tradeoff is cost and latency. Query transformation means extra LLM calls before retrieval even runs. I wouldn't add it until simpler retrieval improvements (hybrid search, reranking) have stopped moving the metrics — and on an on-call agent, the added 200–500ms can matter.

⚡ Production takeaway: Query rewriting helps when users ask questions in different language than your documents use — but measure the latency cost before making it default.

At this point, you have a sophisticated retrieval pipeline. The final question is whether you can prove it actually works.

Layer 8: Evaluation & Failure Modes — Knowing If It Actually Works

I've seen plenty of teams skip evaluation until something breaks. Don’t make that bet.

Fifty hand-answered questions and a basic RAGAS evaluation loop will tell you more about your pipeline than weeks of intuition. RAGAS is an open-source evaluation framework for RAG systems that helps measure whether your answers are grounded, relevant, and supported by the retrieved context.

RAGAS measures three things that matter:

Faithfulness — does the answer match the retrieved context?
Answer relevancy — does it actually address the question?
Context recall — did retrieval surface the right chunks?

Your eval set should mix easy questions, ambiguous ones, exact-keyword lookups, synonym-heavy phrasings, and questions where the right answer is “I don’t know.” A pipeline that only handles the easy cases is not ready.

But before you can fix what evaluation surfaces, you need to recognize the failure modes. Here are the five I’ve personally hit, formatted as a diagnostic checklist.

Failure mode 1: Retrieval looks right, but the answer is wrong

Symptom: The right chunks appear in your retrieval logs. The model ignores them.
Fix: Reorder context so the highest-relevance chunks are at positions 1 and N — not buried in the middle.

I spent three days on this exact bug. Retrieval metrics were fine. The right chunks came back. Answers were still wrong on a specific category of questions. Eventually I added logging that printed the exact context being sent to the LLM — and the relevant chunk was always position 4 or 5 out of 6. Dead center of the context window.

This is often called the “lost in the middle” problem: models tend to use information at the beginning and end of long contexts more reliably than information buried in the middle. The fix can be almost embarrassingly simple. One reorder, problem mostly gone.

Failure mode 2: Silent truncation

Symptom: Long chunks retrieve worse than short ones. No error in your logs.
Fix: Log token counts during ingestion. Cap chunks below your embedding model's limit.

Failure mode 3: Wrong embedding model at query time

Symptom: Similarity scores look reasonable but retrieval quality is random.
Fix: Store the embedding model name in vectorstore metadata. Assert it matches at startup.

Failure mode 4: Chunk boundary severing context

Symptom: Specific facts the user asks about are never retrieved.
Fix: Increase chunk overlap, or move to semantic / parent-child chunking.

Failure mode 5: Stale index

Symptom: Users report outdated information you've already fixed in the source docs.
Fix: Hash each document on ingestion. Re-embed only when the hash changes.

None of these are exotic edge cases. All five have hit pipelines I've shipped, and all five are silent, which is why you need evaluation in place before you trust your logs.

And on an on-call RAG agent, they're worse than usual: when you're paged at 3am, you don't have time to fact-check the answer before acting on it.

⚡ Production takeaway: Without evaluation, every RAG improvement is just a guess.

The Debugging Order I Use Now

When a RAG system gives bad answers, I no longer start by blaming the LLM. I debug the pipeline in this order — most failures show up in the first few:

Are chunks being silently truncated? Log token counts during ingestion.
Are chunk boundaries preserving enough context? Check that key information isn't split across chunks.
Is the embedding model appropriate for the domain? Don't trust general benchmarks for niche corpora.
Is retrieval using both semantic and exact-match signals? If you're not doing hybrid search, start there.
Is the vector index trading away too much recall for speed? Measure both before tuning either.
Are we reranking before sending context to the LLM? A cross-encoder pass over top-20 is the highest-ROI change in most pipelines.
Are the highest-relevance chunks placed where the model can use them? Position 1 and position N, not the middle.
Is the index stale? Hash documents on ingest and re-embed when content changes.
Do we have an evaluation set that proves the system improved? Fifty hand-answered questions + RAGAS is enough to start.

RAG doesn't usually fail in one dramatic place. It fails through small losses across layers — a 10% loss here, a 5% loss there, none of them visible in logs. Production RAG is the work of finding those losses before your users do.

What This All Adds Up To

Eight layers. Each one a decision point. Most short tutorials skip seven of them.

If you are starting from a basic RAG pipeline and wondering why it is not good enough, start with three places: Layer 2, Layer 5, and Layer 8.

Layer 2 tells you whether your chunks preserve the right context. Layer 5 tells you whether retrieval can handle both exact terms and semantic meaning. Layer 8 tells you whether any of your changes actually improved the system.

Fix those three before touching anything else.

The first pipeline I built ignored all eight layers. The second got a few of them right. The third one — the one I would put my name on — got every layer right enough that when it broke, I knew where to look.

That is the actual difference between a demo and a production RAG system.

Not the framework.
Not the model.
Understanding what each layer is doing.

DEV Community