DEV Community: aashna mahajan

RAG in 8 Layers: The Production Mental Model Most Tutorials Skip

aashna mahajan — Mon, 15 Jun 2026 01:21:55 +0000

Have you ever shipped something built on RAG and a week later watched it break in a way no tutorial prepared you for? Confident answers, real citations, wrong conclusions, and nothing obviously wrong in your logs.

Most RAG tutorials teach you how to build a demo. This article is about what breaks after the demo works.

I’ve hit that wall enough times to stop blaming the LLM first and start inspecting each layer of the pipeline.

To make this concrete, I'll use the same running example throughout: an on-call AI assistant that helps engineers debug incidents in real time. It ingests runbooks, past incident postmortems, internal architecture docs, and recent alerts. When something pages you at 3am, you can ask, "redis_p99 latency spiking, what do I check first?" and get an actually useful answer grounded in your team's docs.

Layer 1: Tokenization — Your Text Isn't Text to the Model

The first time tokenization broke a RAG pipeline I'd built, I'd indexed about 3,000 documents and couldn't figure out why retrieval was getting worse the longer the document was. My embedding model's tokenizer was silently clipping the last ~15% of every long chunk. I had to re-ingest everything.

That's when I learned: before a single word enters an embedding model, it gets converted to tokens. And tokenization shapes everything downstream — how your chunks behave and how the model reads a sentence.

Think of it like Scrabble tiles. English has ~170,000 words, but you don't get one tile per word. You get a fixed set of tiles covering common letter patterns. "unbelievable" might be three tiles: "un", "believ", "able". Technical vocabulary often fragments more aggressively because it appears less often in training data.

Most modern tokenizers work by repeatedly merging common character patterns into reusable tokens. For RAG, the exact tokenizer algorithm matters less than the operational consequence: your chunk size is measured in model tokens, not human words. If you don’t measure that, your retrieval corpus may already be damaged before indexing begins.

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

text = "Retrieval-augmented generation uses vector embeddings"
tokens = enc.encode(text)
decoded = [enc.decode([t]) for t in tokens]

print(f"Token count: {len(tokens)}")
print(decoded)
# ['Ret', 'riev', 'al', '-', 'augmented', ' generation', ' uses', ' vector', ' embed', 'dings']

A "512-token chunk" is not 512 words. It's ~350–400 words of normal prose, far fewer for code or domain jargon. If your embedding model has a 512-token limit and you feed it 600-token chunks, it may silently truncate the tail. No error. Just missing context.

Runbooks are a worst-case scenario for this. Error codes tokenize cleanly, but inline shell commands fragment into 8+ tokens each. A runbook chunk that looks small in word count can blow past your token limit fast—and the part that gets clipped is usually the fix.

In any production RAG pipeline, this is the layer that quietly decides whether your long documents survive intact. Log your token counts during ingestion — every time.

⚡ Production takeaway: Always log token counts during ingestion. Silent truncation can quietly remove the exact context your model needs.

Tokenization sets the floor. The next decision determines whether your text becomes searchable at all: how you cut documents into chunks.

Layer 2: Chunking — The Decision That Breaks Pipelines Before Retrieval

Here's something I wish someone had told me early: none of the other layers matter if your chunks are bad. I've seen teams spend two weeks tuning embedding models and rerankers when the real problem was that their chunks were splitting tables in half. Get chunking right first. Everything else is optimization on top of that foundation.

Think of it like cutting a textbook into flashcards. Too big and each card is overwhelming. Too small and "it refers to the previous entity" on a flashcard tells you nothing.

Fixed-size chunking

The example below uses LangChain only to show the idea; the same chunking strategy applies in custom ingestion jobs too.

from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=256, chunk_overlap=32)
chunks = splitter.split_text(document)

For most document-style RAG pipelines, overlap is safer than hard cuts. Without it, sentences straddling boundaries get their meaning split across two chunks—neither is retrieved correctly.

Fixed-size chunking is simple and predictable, but it can cut through tables, lists, procedures, or paragraphs without understanding the document structure.

Recursive chunking

Try to split on paragraph breaks first, then sentences, then words, then characters. Produces semantically coherent chunks without embedding every sentence. Solid default for any RAG pipeline. Another option is semantic chunking, where meaning itself decides where to split.

Semantic chunking

Embed every sentence. Find where cosine similarity between adjacent sentences drops sharply — that's a topic transition. Split there.

Semantic chunking sounds better than it performs in most real projects. The quality gain is real but the ingestion cost surprises teams every time. I'd reach for recursive splitting first, ship it, measure retrieval quality, and only upgrade if the metrics tell you to. The pattern I usually prefer for production document RAG is parent-child chunking.

Parent-Child chunking: my default production recommendation

Index small chunks (128 tokens) for retrieval precision. Return the large parent chunk (512 tokens) to the LLM for generation context.

small_chunk = vectorstore.similarity_search(query, k=5)[0]
parent_id = small_chunk.metadata["parent_id"]
full_context = parent_store.get(parent_id)

Small chunks match precisely. Larger parent chunks give the LLM enough context to answer coherently. It is a strong default before reaching for anything fancier.

The tradeoff is that parent chunks can add extra context, so you still need to keep parent size bounded.

For our on-call agent, this is perfect for step-by-step runbooks. Index each step as a small chunk so retrieval can pinpoint the exact action, but return the full runbook section so the engineer has the surrounding context for why that action matters.

⚡ Production takeaway: Bad chunks break retrieval before your embedding model, vector DB, or reranker ever gets a fair chance.

Now you have well-formed chunks. The next layer turns them into something searchable.

Layer 3: Embeddings — Coordinates in Meaning Space

An embedding turns text into a vector: hundreds or thousands of floating-point numbers that represent meaning. Semantically similar texts end up geometrically close in that space. The classic intuition is king - man + woman ≈ queen — embeddings can encode semantic relationships as geometric patterns, even if real embedding spaces are much messier than the example suggests.

But retrieval representations are not all the same. In practice, there are three patterns worth understanding.

Sparse lexical retrieval: BM25 and TF-IDF uses vocabulary-sized vectors where most values are zero. Only the words that actually appear in the text are non-zero.

BM25 is a ranking function that scores documents based on how often query terms appear, how rare those terms are, and how long the document is. It is widely used in search engines because it is fast, interpretable, and devastatingly good at exact keyword matching.

It also fails the moment the user and the document use different words for the same idea.

I went through a stretch where almost every retrieval bug I debugged came down to a BM25 vs dense retrieval mismatch. The user used a synonym, BM25 returned nothing, and dense retrieval surfaced something semantically related but not precise enough. Once you know what each method does well, you stop guessing.

This matters acutely for our on-call agent. Exact error codes like ECONNREFUSED, OOMKilled, and 503 need lexical search to surface fast. Symptom descriptions like “database queries timing out under load” need dense embeddings to match the runbook that says “connection pool exhaustion.” Neither alone covers the queries you would actually type at 3am.

Dense embeddings (BERT, Sentence Transformers) produce a single compact vector per text — every dimension non-zero, semantic meaning packed in. This is what most people mean when they say "embedding model."

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

docs = [
    "The automobile industry is shifting to electric vehicles",
    "Car manufacturers are investing in battery technology",
    "BM25 is a keyword-based retrieval algorithm",
]

query = "battery-powered auto makers"

doc_vectors = model.encode(docs)
query_vector = model.encode([query])

similarities = np.dot(doc_vectors, query_vector.T).flatten()
print(similarities)
# [0.721, 0.683, 0.201]

Even when the wording changes, dense embeddings can still surface the right documents. BM25 is much more brittle here because it depends on lexical overlap.

The weakness goes the other way: dense models can miss exact matches. Ask for GPT-4o, and a dense retriever might surface documents about GPT-3 because they are semantically close, even though the exact version matters.

Late interaction (ColBERT) keeps a separate vector per token instead of compressing the whole text into one vector. At query time, each query token finds its best-matching document token, and those matches are combined into a final score. This can improve precision on long documents because the model does not lose as much token-level detail. The tradeoff is storage and serving complexity.

I evaluated ColBERT on a project where retrieval quality was the bottleneck. The benchmarks were genuinely better — about 8% improvement on our test set. The storage cost was 14x. We shipped dense retrieval plus a cross-encoder reranker instead, got within 2% of ColBERT’s quality, and kept the infrastructure bill sane. Worth knowing it exists. It's also worth being honest that dense retrieval plus reranking covers many production use cases at a fraction of the complexity.

The model choice matters, but the bigger lesson is this: benchmark on your own queries. Leaderboards do not know your documents, your users, or your failure modes.

⚡ Production takeaway: Dense embeddings are great for meaning. BM25 is great for exact terms. Production systems often need both.

You can now turn chunks into vectors. The next problem is finding the right ones in under 100ms when you have millions of them.

Layer 4: Vector Storage & Indexing — How ANN Search Actually Works

A pipeline I shipped last year worked great in testing—sub-second responses and accurate answers. Six months later, after we’d ingested about 5x more documents, queries were taking 7–8 seconds. Nothing in our code had changed. I spent two days assuming the embedding model had degraded somehow before realizing the vector index was the bottleneck.

I’d never bothered to understand what was actually inside the vector database I was calling—and when it broke, I had no framework to debug it.

Most RAG tutorials hand-wave this layer: “Just use Pinecone.” That’s fine until your pipeline slows down, retrieves the wrong context, or leaves the model to guess because recall is too low.

With millions of embeddings, brute-force comparison quickly becomes too slow for interactive systems. Production systems usually need Approximate Nearest Neighbor (ANN) indexing. ANN means: instead of checking every vector, the system finds a near-best match much faster, trading a small amount of recall for a large speedup.

HNSW (Hierarchical Navigable Small World) is one of the most common index types used by vector databases. It builds a multi-layer graph: the top layer has fewer nodes with long-distance connections, like highways, and the bottom layer has denser connections, like local streets. A query enters at the top, greedily navigates toward the answer, then descends layer by layer for finer resolution.

What "tuning the index" actually means

When teams say "tune the index," they usually mean tuning the tradeoff between recall, latency, memory, and build time. The exact knobs depend on the indexing algorithm your vector database uses.

For an HNSW-based index, common knobs include:

M / m — how connected each vector is in the graph. Higher values usually improve recall but use more memory.
ef_construction / efConstruction — how much effort the index spends building the graph. Higher values can improve index quality but make indexing slower.
ef_search / efSearch / hnsw_ef — how many candidates the search explores at query time. Higher values usually improve recall but increase latency.

import hnswlib

index = hnswlib.Index(space="cosine", dim=384)
index.init_index(max_elements=1_000_000, ef_construction=200, M=16)
index.set_ef(50)  # Tune at query time for recall vs latency tradeoff

Other index types expose different knobs. IVF (Inverted File Index) splits the vector space into clusters and only searches the nearest ones — it tunes how many clusters to probe with settings like nprobe. Quantized indexes tune compression settings. Some managed vector databases hide most of these details and expose higher-level settings instead.

In production, you also need to think about metadata filters — searching only runbooks for the affected service, region, environment, or time window — because filtering can change both latency and recall.

The important idea is not the parameter name. It’s the tradeoff: better recall usually costs more memory, more indexing time, or more query latency.

In plain English: vector indexes are shortcuts. They avoid comparing your query to every single vector by searching only the most promising parts of the space. The more shortcuts you take, the faster the search becomes, but the more likely you are to miss the best result. Index tuning is about deciding how much accuracy you're willing to trade for speed.

For our on-call agent, this is non-negotiable. Retrieval needs to be fast enough that the full assistant still feels interactive. If vector search alone takes 8 seconds, the engineer has already opened the runbook manually.

⚡ Production takeaway: Vector indexes are speed-recall tradeoffs. Tune them only after measuring latency and retrieval quality.

You can now retrieve the right vectors quickly. The next question is: which retrieval strategy?

Layer 5: Retrieval Strategies — Hybrid Search and MMR

The first production issue I hit on my second pipeline was a user typing “PTO policy” against a document that said “annual leave entitlement.” Dense retrieval found it, but buried it at rank 4. BM25 had nothing useful to contribute because the words did not overlap.

Technically, the system worked. Practically, it failed — nobody scrolls through five RAG citations during a real task.

Hybrid search moved the right document into the visible results, and that was the moment I stopped trusting either retrieval method alone.

Hybrid Search via Reciprocal Rank Fusion

Run BM25 and dense retrieval in parallel. Merge with Reciprocal Rank Fusion (RRF) — a simple algorithm that combines multiple ranked lists into one by rewarding documents that appear high in any list:

def reciprocal_rank_fusion(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

bm25_results  = ["doc3", "doc1", "doc5"]
dense_results = ["doc1", "doc2", "doc3"]
fused = reciprocal_rank_fusion([bm25_results, dense_results])
# ["doc1", "doc3", "doc2", "doc5"]

A document that ranks high in both lists beats one that ranks first in only one. BM25 catches exact matches dense misses. Dense catches synonyms BM25 ignores. The combined ranking often beats either alone, especially when your queries mix exact terms and semantic descriptions.

Picture it for our on-call agent: an engineer types "redis_p99 latency spiking" against a runbook titled "Redis tail latency investigation." Dense gets close (rank 4). BM25 misses entirely (no shared keywords). Hybrid surfaces it at rank 1 because both methods voted for it from different angles.

MMR (Maximal Marginal Relevance)

MMR is a ranking strategy that picks results which are both relevant to the query and dissimilar from each other.

In practice, it's most useful when your corpus contains repeated versions of the same concept — multiple postmortems on the same incident, similar runbooks, duplicated wiki pages, copied troubleshooting steps. Without MMR, your top-5 ends up being five near-duplicates of the same chunk.

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.7},
)

lambda_mult is the dial — closer to 1 prioritizes relevance, closer to 0 prioritizes diversity. 0.7 is a sensible default.

⚡ Production takeaway: Hybrid search and MMR help your system retrieve both relevant and diverse context, instead of five near-duplicates of the same chunk.

Hybrid search gives you a good top-20. The next layer turns that into a great top-5.

Layer 6: Re-Ranking — The Precision Pass

This is the layer that gave me the biggest single quality jump in any RAG pipeline I’ve built.

Before reranking, our RAGAS faithfulness score was 0.71. After adding a cross-encoder reranker over a top-20 candidate set, it jumped to 0.86 — bigger than the gain we got from switching to a larger embedding model and tuning chunk sizes combined. This is the one change I'd push for first in any pipeline review.

RAGAS is an evaluation framework for RAG systems; I’ll come back to it in Layer 8. For now, the important point is that reranking changed the quality of what we sent to the LLM.

Your bi-encoder search computes query and document vectors independently. That makes it fast, but imprecise, because the model never sees the query and document together.

A cross-encoder is a model that takes the query and a document as a single input and scores how well they match — slower than independent encoding, but far more accurate because it can attend to the interaction between them directly.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
candidates = [chunk1, chunk2, chunk3, ...]  # Top-20 from Layer 5
scores = reranker.predict([(query, doc) for doc in candidates])
reranked = sorted(zip(candidates, scores), key=lambda x: -x[1])

The tradeoff is latency, you can't run a cross-encoder over millions of documents, only over the top-20 you already retrieved. Retrieve broadly, rerank precisely.

For our on-call agent, this is the difference between "here are 5 things that might be related" and "here's exactly what to check first." Reranking is what makes the assistant feel useful instead of like a fancy search bar.

⚡ Production takeaway: Retrieve broadly, rerank precisely, and only send the best chunks to the LLM.

Layer 7: Query Transformation — HyDE and RAG Fusion

Sometimes retrieval fails not because your chunks, embeddings, or index are bad, but because the user asks the question in a different language than your documents use.

Your document might say “Q3 churn reduction tactics.” Your user might ask, “How should we approach customer retention?” Your runbook might say “Redis memory pressure investigation.” Your on-call engineer might type, “Is our cache dying?”

Query transformation tries to close that vocabulary gap before retrieval happens.

RAG Fusion

RAG Fusion generates several reformulations of the original query, retrieves results for each one, and fuses the rankings into a single result list.

The goal is simple: do not let one bad phrasing sink retrieval.

queries = [original] + llm.invoke(
    f"Generate 4 reformulations of: {original}"
).content.split("\n")
all_results = [retrieve(q) for q in queries]
fused = reciprocal_rank_fusion(all_results)

One reformulation often gets closer to the document’s vocabulary. Another may preserve the user’s original wording. Fusing them gives retrieval multiple chances to find the right context.

HyDE (Hypothetical Document Embeddings)

HyDE is a technique where the LLM first generates a hypothetical answer to the user’s question. You then embed that hypothetical answer and search with it instead of the raw question.

It sounds backwards. It often works because the hypothetical answer may use language closer to your documents than the original query does.

I’ll be honest: HyDE felt like a gimmick when I first read about it. Then I had a project where users asked questions like “how should we approach customer retention?” against a corpus of operational documents that said things like “Q3 churn reduction tactics include…” The vocabulary mismatch was killing retrieval.

HyDE generated a hypothetical answer in operational language, embedded that, and suddenly the right documents started surfacing.

It is not always the answer. But when query vocabulary diverges sharply from document vocabulary, it is one of the cleanest fixes I know.

The on-call agent version: an engineer types, “Is our cache dying?” That does not match anything in your runbooks. But HyDE might generate a hypothetical answer like, “Redis memory pressure investigation steps include…” and suddenly the right runbook surfaces.

The panicked, conversational query becomes a technical one.

The tradeoff is cost and latency. Query transformation means extra LLM calls before retrieval even runs. I wouldn't add it until simpler retrieval improvements (hybrid search, reranking) have stopped moving the metrics — and on an on-call agent, the added 200–500ms can matter.

⚡ Production takeaway: Query rewriting helps when users ask questions in different language than your documents use — but measure the latency cost before making it default.

At this point, you have a sophisticated retrieval pipeline. The final question is whether you can prove it actually works.

Layer 8: Evaluation & Failure Modes — Knowing If It Actually Works

I've seen plenty of teams skip evaluation until something breaks. Don’t make that bet.

Fifty hand-answered questions and a basic RAGAS evaluation loop will tell you more about your pipeline than weeks of intuition. RAGAS is an open-source evaluation framework for RAG systems that helps measure whether your answers are grounded, relevant, and supported by the retrieved context.

RAGAS measures three things that matter:

Faithfulness — does the answer match the retrieved context?
Answer relevancy — does it actually address the question?
Context recall — did retrieval surface the right chunks?

Your eval set should mix easy questions, ambiguous ones, exact-keyword lookups, synonym-heavy phrasings, and questions where the right answer is “I don’t know.” A pipeline that only handles the easy cases is not ready.

But before you can fix what evaluation surfaces, you need to recognize the failure modes. Here are the five I’ve personally hit, formatted as a diagnostic checklist.

Failure mode 1: Retrieval looks right, but the answer is wrong

Symptom: The right chunks appear in your retrieval logs. The model ignores them.
Fix: Reorder context so the highest-relevance chunks are at positions 1 and N — not buried in the middle.

I spent three days on this exact bug. Retrieval metrics were fine. The right chunks came back. Answers were still wrong on a specific category of questions. Eventually I added logging that printed the exact context being sent to the LLM — and the relevant chunk was always position 4 or 5 out of 6. Dead center of the context window.

This is often called the “lost in the middle” problem: models tend to use information at the beginning and end of long contexts more reliably than information buried in the middle. The fix can be almost embarrassingly simple. One reorder, problem mostly gone.

Failure mode 2: Silent truncation

Symptom: Long chunks retrieve worse than short ones. No error in your logs.
Fix: Log token counts during ingestion. Cap chunks below your embedding model's limit.

Failure mode 3: Wrong embedding model at query time

Symptom: Similarity scores look reasonable but retrieval quality is random.
Fix: Store the embedding model name in vectorstore metadata. Assert it matches at startup.

Failure mode 4: Chunk boundary severing context

Symptom: Specific facts the user asks about are never retrieved.
Fix: Increase chunk overlap, or move to semantic / parent-child chunking.

Failure mode 5: Stale index

Symptom: Users report outdated information you've already fixed in the source docs.
Fix: Hash each document on ingestion. Re-embed only when the hash changes.

None of these are exotic edge cases. All five have hit pipelines I've shipped, and all five are silent, which is why you need evaluation in place before you trust your logs.

And on an on-call RAG agent, they're worse than usual: when you're paged at 3am, you don't have time to fact-check the answer before acting on it.

⚡ Production takeaway: Without evaluation, every RAG improvement is just a guess.

The Debugging Order I Use Now

When a RAG system gives bad answers, I no longer start by blaming the LLM. I debug the pipeline in this order — most failures show up in the first few:

Are chunks being silently truncated? Log token counts during ingestion.
Are chunk boundaries preserving enough context? Check that key information isn't split across chunks.
Is the embedding model appropriate for the domain? Don't trust general benchmarks for niche corpora.
Is retrieval using both semantic and exact-match signals? If you're not doing hybrid search, start there.
Is the vector index trading away too much recall for speed? Measure both before tuning either.
Are we reranking before sending context to the LLM? A cross-encoder pass over top-20 is the highest-ROI change in most pipelines.
Are the highest-relevance chunks placed where the model can use them? Position 1 and position N, not the middle.
Is the index stale? Hash documents on ingest and re-embed when content changes.
Do we have an evaluation set that proves the system improved? Fifty hand-answered questions + RAGAS is enough to start.

RAG doesn't usually fail in one dramatic place. It fails through small losses across layers — a 10% loss here, a 5% loss there, none of them visible in logs. Production RAG is the work of finding those losses before your users do.

What This All Adds Up To

Eight layers. Each one a decision point. Most short tutorials skip seven of them.

If you are starting from a basic RAG pipeline and wondering why it is not good enough, start with three places: Layer 2, Layer 5, and Layer 8.

Layer 2 tells you whether your chunks preserve the right context. Layer 5 tells you whether retrieval can handle both exact terms and semantic meaning. Layer 8 tells you whether any of your changes actually improved the system.

Fix those three before touching anything else.

The first pipeline I built ignored all eight layers. The second got a few of them right. The third one — the one I would put my name on — got every layer right enough that when it broke, I knew where to look.

That is the actual difference between a demo and a production RAG system.

Not the framework.
Not the model.
Understanding what each layer is doing.

RAG Explained for Beginners: How AI Assistants Stop Making Things Up

aashna mahajan — Sun, 31 May 2026 00:41:19 +0000

I once submitted an essay with three citations that I hadn't personally verified. The AI had suggested them, and they sounded right.

None of them existed.

That's not a quirk or a bug — it's exactly how LLMs work. And once you understand why, a technique called RAG starts to make a lot of sense.

AI assistants are remarkably good at sounding right. The model isn't lying — it's doing its best with what it knows. The problem is that what it knows has limits, and it doesn't always know where those limits are. Ask one about a recent event, a niche regulation, or anything from a source it's never seen — and it fills the gap anyway. Confidently.

That's the gap RAG was built to close. Once you understand how it works, you'll have a much clearer picture of why some AI tools are genuinely reliable and others are just very convincing guessers.

Here's what's actually going on.

First, What's the Problem?

Large language models (LLMs)—the technology powering AI assistants like ChatGPT and Claude—are trained on vast amounts of data from across the internet. That training gives them a remarkable ability to reason, summarize, and generate content. But it also comes with some real limitations:

They have a knowledge cutoff. An LLM trained last year doesn't know what happened last month.
They can hallucinate. When they don't know something, they don't say "I don't know"—they generate a confident-sounding answer anyway. Wrong facts, fake statistics, invented sources. All delivered with a straight face.
They don't know your specific sources. Think of a software engineer asking an AI assistant about their company's internal API documentation, deployment runbooks, or architecture decisions. None of that is in the training data. The model has never seen it — and it will still try to answer.

The model isn't lying — it's generating the most plausible answer it can. It just has no way to know when it's wrong.

So, what do you do when you need an AI that's accurate, current, and knows your specific domain? That's the problem RAG was designed to solve.

What Is RAG?

RAG stands for Retrieval-Augmented Generation.

Here's the plain-English version: Instead of relying purely on what an LLM memorized during training, RAG looks things up first—then uses what it found to answer your question.

Think of it like the difference between two types of students taking a test:

Student A (plain LLM): Studied everything months ago and answers purely from memory.
Student B (RAG): Gets to bring a set of reference documents to the exam and reads the relevant parts before answering.

Student B is going to be a lot more accurate — especially on recent or niche topics.

Same student, same question — completely different results depending on whether they can consult real sources.

Put it another way: RAG = looking up answers in a book + writing your own answer using what you found.

One thing worth saying upfront: RAG doesn't make an AI system magically correct. It gives the model better material to work with. If the retrieved documents are wrong, outdated, or irrelevant, the answer can still be wrong. The quality of the output is only as good as the quality of the sources.

How RAG Works, Step by Step

Here's the basic flow:

User Question → Retriever → Relevant Documents → Prompt + Context → LLM → Answer

Each step is simpler than it sounds.

Step 1: User Asks a Question

Simple enough. A user types something like, "What's the refund policy for orders over $100?"

Step 2: The Question Gets Turned Into a "Meaning Fingerprint"

Before the system can search anything, it needs to understand what the question means — not just the exact words. So it runs the question through an embedding model, which converts it into a list of numbers called a vector (or embedding).

Think of it as a meaning fingerprint: similar ideas produce similar vectors, even if they're phrased differently. This is how the system can match "refund policy" to a document that says "return and reimbursement guidelines"—same concept, different words.

Different words, nearly identical vectors. That's what lets the retriever find the right document even when the user's phrasing doesn't match exactly.

Step 3: The System Retrieves Relevant Information

That vector gets compared against a vector database—a collection of pre-processed document chunks, each already converted into their own meaning fingerprints. The system finds the chunks that are closest in meaning to your question and pulls them up.

The result: a handful of the most relevant text snippets from your knowledge base.

Step 4: The Retrieved Context Gets Added to the Prompt

The system packages the user's question and the retrieved text together into a single prompt:

"Using the following information, answer the user's question. If the answer isn't in the context, say you don't know. Information: [retrieved document text]. Question: What's the refund policy for orders over $100?"

Step 5: The LLM Generates an Answer

Now the LLM responds — but it's grounded in the actual documents, not just its training data. The answer is more accurate, more specific, and far less likely to be hallucinated.

Don't code yet? Skip straight to the concrete example below—you'll understand how RAG works without needing this.

If you do write Python, here's what all five steps look like—the actual library you use (LangChain, LlamaIndex, or plain OpenAI SDK) slots into the same shape:

# Step 1–2: Load your documents, chunk them, convert to vectors, store
chunks = load_and_chunk("support_docs/")
vector_db = embed_and_store(chunks)

# Step 3: User asks a question — find the most relevant chunks
query = "Does AcmeSoft support two-factor authentication?"
relevant_chunks = vector_db.search(query, top_k=3)

# Steps 4–5: Build a grounded prompt, send to the LLM
prompt = f"""
Answer using only the context below.
If the answer isn't there, say you don't know.

Context: {relevant_chunks}
Question: {query}
"""
answer = llm.generate(prompt)

# → "Yes, AcmeSoft supports 2FA for enterprise accounts via the Security tab..."

The shape is always the same: load → embed → retrieve → prompt → answer. The library you pick just fills in the blanks.

A Concrete Example

Let's make this tangible.

User asks, "Does AcmeSoft support two-factor authentication for enterprise accounts?"

Retrieved document snippet (from AcmeSoft's internal support docs):

"Enterprise accounts on AcmeSoft can enable two-factor authentication (2FA) through the Security tab in Account Settings. Both TOTP apps (like Google Authenticator) and SMS-based verification are supported."

Prompt sent to the LLM:

"Using the following information, answer the user's question. If the answer isn't here, say you don't know. Information: [snippet above]. Question: Does AcmeSoft support two-factor authentication for enterprise accounts?"

LLM's answer:

"Yes! AcmeSoft supports two-factor authentication for enterprise accounts. You can enable it from the Security tab in your Account Settings. They support both authenticator apps (like Google Authenticator) and SMS verification."

That answer is accurate, grounded in real documentation, and actually useful. Without RAG, the LLM would have no idea what AcmeSoft's features are.

Ask → Retrieve → Answer. The robot isn't guessing — it's reading the filing cabinet first.

The Tools That Make RAG Happen

The good news: you don't have to build any of this from scratch. Several popular libraries handle the heavy lifting:

LangChain — A popular Python and JavaScript framework for building RAG pipelines.
LlamaIndex — Connects LLMs to your private data; great for document-heavy use cases.
Haystack — An open-source framework built specifically for search and question-answering systems.
FAISS — A fast vector search library from Meta, often used for local or custom setups.
Chroma — A lightweight vector database that's beginner-friendly for small projects.
Pinecone / Weaviate — Cloud-hosted vector databases commonly used for production-scale RAG systems.

If you're just starting out, LangChain or LlamaIndex are the most beginner-friendly—the others become relevant as you scale.

The RAG toolbox—pick the pieces that match your use case. You rarely need all of them at once.

Real-World Use Cases

RAG is already quietly powering some very practical tools across industries:

Customer support, healthcare, legal, education, engineering, research — the same pattern works across all of them.

Customer support bots — A chatbot that answers product questions using your actual support documentation, not guesses.
Company knowledge assistants — An internal AI that lets employees search HR policies, engineering wikis, or onboarding guides through natural conversation.
Research assistants — Tools that help academics or analysts quickly find and synthesize information from large document libraries.
Legal and compliance Q&A — AI that answers questions about contracts or regulations while citing the exact clause it's drawing from.
Healthcare knowledge bases — Systems that help clinical staff query medical literature or hospital protocols accurately.
Educational tutoring — Q&A bots that answer student questions directly from course textbooks and materials.

In every case: bring in domain-specific knowledge, ground the AI's answers in it, and dramatically reduce the risk of wrong or outdated responses.

RAG Is Powerful — But Not Perfect

RAG works best when:

Your documents are accurate, well-organised, and up to date.
The question clearly maps to something in the knowledge base.
You want transparent, source-grounded answers.

RAG can still struggle when:

The source documents are bad. Garbage in, garbage out — if your knowledge base has outdated or incorrect information, the LLM will use it anyway.
The retriever misses the mark. If the system can't find the right chunks, the LLM has nothing useful to work with and may still hallucinate.
Too much irrelevant context gets retrieved. Noisy or off-topic chunks can confuse the LLM and dilute the answer.

Feed it bad documents, and you get bad answers—confidently delivered. RAG doesn't fix bad data, it amplifies it.

Knowing the failure modes is half the battle. A well-built RAG system spends just as much effort on clean data and good retrieval as it does on the LLM itself.

Next Steps: Want to Build Your Own?

You don't need to start big. A few entry points depending on how comfortable you are with code:

Try a no-code tool first. Platforms like Dify or Flowise let you build a basic RAG chatbot with drag-and-drop interfaces — no coding required.
Follow a LangChain RAG tutorial. The official docs have beginner walkthroughs that are surprisingly approachable if you know a bit of Python.
Experiment with LlamaIndex. Their starter tutorial gets a simple Q&A system running over your own documents in under 30 lines of code.
Start small. Pick a single topic — your own notes, a product FAQ, a short research paper — and build a basic question-answering tool over it. The concepts will click fast once you see it working.
Understand the infrastructure underneath. RAG systems rely on the same distributed data concepts that power production backends—vector databases, caching, and scaling decisions. If those feel unfamiliar, this system design primer is a good place to close that gap.

Once you understand how RAG works—retrieve, augment, generate—you'll start seeing it everywhere.

And now you know what it actually means.

Found this useful? I write about AI, system design, and real engineering. Follow along—more coming.

5 System Design Concepts Every Software Engineer Should Understand Before Interviews

aashna mahajan — Sun, 24 May 2026 22:37:04 +0000

I failed three system design interviews in a row.

Not because I didn't know the concepts. I knew them cold. Caching, sharding, consistent hashing, CAP theorem, message queues — I could define every one.

What I couldn't do: answer what came next.

"What happens when the cache gets stale?"
"Why are you sharding this?"
"So you'd ignore partition tolerance?"

Every time, I had the surface answer. Every time, the follow-up question exposed that I'd never thought one level deeper.

These are the 5 gaps that cost me.

Each section follows the same pattern: the surface answer, the hidden follow-up, and the trade-off that actually matters.

1. Everyone Adds a Cache. Almost Nobody Thinks About What Comes Next.

I used to say "add Redis" like it was a complete answer.

Performance problem? Redis. Slow API? Redis. Interviewer asks about read load? Redis.

Then one interviewer asked: "What happens when the user updates their profile?"

I stared. I knew how to add a cache. I hadn't thought once about what happens when the data behind it changes.

That's the trap. A cache is your phone saving images from apps so it doesn't re-download them every time — simple, fast, invisible. In backend systems, Redis stores hot data entirely in RAM instead of hitting a database on every request. Think of it as a database that never sleeps. Responses in under a millisecond.

Most read traffic is for the same hot data — the same user profiles, the same popular posts. A cache catches all of that before the database ever has to wake up.

Here's how the cache-aside pattern looks in code — the most common one you'll see:

def get_user(user_id: str):
    # Step 1: Check cache first
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)        # Cache hit ✓

    # Step 2: Cache miss — go to the database
    user = db.find_one({"_id": user_id})

    # Step 3: Store result for next time (expires in 1 hour)
    redis.setex(f"user:{user_id}", ttl=3600, value=json.dumps(user))
    return user

# ⚠️ The hidden danger: what if the user updates their profile?
# If you forget: redis.delete(f"user:{user_id}")
# ...they'll see stale data for up to an hour.

Three strategies — each is right in one situation and quietly catastrophic in the wrong one.

I once watched a team spend three days debugging incorrect pricing at checkout. The cache populated correctly on product creation — but silently failed to invalidate when the price changed. Wrong prices served for six weeks. Code looked fine. Tests passed.

A cache without an invalidation strategy isn't a performance win. It's a time bomb with a clean interface.

💬 Beginner answer: "Add Redis."
Strong answer: Add Redis with a clear invalidation strategy — delete the cache key on write, set a TTL as a safety net, and know which pattern (cache-aside, write-through, write-behind) fits your consistency requirements.

2. Sharding: Impressive on a Whiteboard, Painful in Production

After making the same mistake myself, I later watched another candidate repeat it almost exactly.

He spent 25 minutes designing a sharding strategy for a system serving 3,000 daily users.

Custom shard keys. Cross-shard routing. Resharding logic. The interviewer stopped him mid-sentence.

"Why are you sharding this?"

He didn't have an answer. He didn't get an offer.

Over-engineering isn't ambition. It's anxiety wearing the mask of thoroughness.

Sharding splits your database across multiple servers — each one owns a slice of the data. It's genuinely powerful, and genuinely hard: joins across shards become painful, transactions need distributed coordination, and debugging requires knowing which shard holds your data.

The right order before you even think about sharding:

Exhaust every step before moving to the next. Most systems never need to go past step 2.

# ❌ Bad shard key — range-based timestamp sharding creates a hot shard
def get_shard(created_at: datetime) -> int:
    # All writes in the current month go to the same shard
    if created_at >= CURRENT_MONTH_START:
        return NUM_SHARDS - 1  # every new write piles here
    return hash(f"{created_at.year}-{created_at.month}") % (NUM_SHARDS - 1)
    # Older shards go cold. The latest shard gets hammered.

# ✅ Good shard key — user_id distributes evenly
def get_shard(user_id: str) -> int:
    return hash(user_id) % NUM_SHARDS
    # Load spreads evenly regardless of when writes happen.

Instagram ran on a single Postgres instance far longer than most people realize. They only sharded when simpler options genuinely couldn't keep up.

💬 Beginner answer: "Split the data across multiple servers."
Strong answer: Shard only after exhausting vertical scaling, read replicas, and caching. Choose a shard key based on access patterns — user ID distributes evenly; timestamps create hot shards.

3. Why Adding One Server Can Break Your Entire Cache

The short answer: naive hashing routes cache keys by key % N. Change N, and almost every key routes to a different server. Your cache goes cold instantly.

# ❌ Naive modulo — breaks the moment you add a server
servers = ["server_1", "server_2", "server_3"]

def get_server(key: str) -> str:
    return servers[hash(key) % len(servers)]

get_server("user_123")   # → "server_2"

# You add a 4th server to handle more load...
servers.append("server_4")
get_server("user_123")   # → "server_1"  ← different server!

# Almost every key now maps somewhere new.
# Your entire cache just went cold. Enjoy the database stampede.

Adding one server to a cache cluster can invalidate most of your cached data at once — causing every request to hit the database simultaneously. That's not a scaling win. That's an outage.

Consistent hashing is the fix. Instead of key % N, both servers and keys are mapped onto a circular ring. Each key belongs to the nearest server clockwise.

Add a server? It takes only the keys between itself and its neighbor — roughly 1/N of data. Everything else stays put.

Adding one node displaces ~1/N keys. With naive modulo hashing, you'd be moving almost everything.

Akamai was built on this. Their founders wrote the original paper in 1997, designing consistent hashing specifically to solve the server-addition problem at CDN scale. Redis Cluster and Cassandra use the same principle today.

💬 Beginner answer: "Consistent hashing distributes keys evenly across servers."
Strong answer: Consistent hashing minimises remapping when the cluster changes — adding a server displaces only ~1/N keys. Naive modulo hashing remaps nearly everything, which turns a scaling win into a cache stampede.

So far we've been talking about scaling reads and distributing data. But distributed systems fail in a second, harder way — not "how do you store more?" but "what happens when the parts of your system stop agreeing with each other?"

4. CAP Theorem Is Taught Wrong. Here's What It Actually Means.

I once confidently explained CAP theorem in an interview. The interviewer looked up and asked:

"So you'd consider building a system that ignores partition tolerance?"

I had nothing. The conversation got uncomfortable fast.

Here's the problem: CAP is almost always taught as "pick any two." That's misleading.

Partition tolerance isn't optional. Networks fail — servers lose the ability to talk to each other. It will happen to your system. So the real choice is:

When a network partition occurs, do you prioritize consistency or availability?

	CP (Consistency first)	AP (Availability first)
Behaviour	Refuses requests until partition heals	Keeps responding, may return stale data
Risk	Downtime during failures	Stale reads
Use when	Payments, inventory, anything financial	Social feeds, recommendations, analytics
Examples	CockroachDB, ZooKeeper, Postgres (sync replication)	Cassandra, DynamoDB, CouchDB

Facebook chose AP for their social graph — a slightly stale follower count beats an app that won't load. Systems that prioritise CP — like CockroachDB or Postgres with synchronous replication — refuse writes during a partition rather than risk inconsistent state. You'd rather reject a transaction than double-charge someone.

💬 Beginner answer: "You can pick any two of consistency, availability, and partition tolerance."
Strong answer: Partition tolerance isn't a choice — networks fail. The real decision is CP vs AP: do you return stale data or refuse requests when nodes can't communicate? Apply it directly: "This is a payments system, so I'd take CP — I'd rather reject a write than risk charging someone twice."

5. Message Queues Don't Guarantee What You Think They Guarantee

Picture a restaurant on a Friday night. Orders are flying in faster than the kitchen can handle. If every waiter walked directly to a chef and demanded immediate attention, the kitchen collapses.

Instead: orders go on a ticket rail. Chefs work through them steadily. The kitchen stays calm no matter how busy the front gets.

That's a message queue.

Uber's trip events flow through Kafka. Netflix triggers encoding jobs through queues. Slack's notification pipeline is async.

Most candidates in interviews draw a queue, say "this decouples the services," and move on. That's not wrong. But here's what nobody mentions until they're paged at 3am:

In most queue systems, design as if any message can arrive more than once.

Retries, consumer crashes, and network timeouts can all cause duplicate processing. A user gets charged twice, an email sends twice, a report generates twice.

def process_payment(message: dict):
    payment_id = message["payment_id"]

    # ✅ Idempotency check — safe to receive this twice
    if db.payment_already_exists(payment_id):
        print(f"Already processed {payment_id}. Skipping.")
        return

    # Process only if we haven't seen this before
    stripe.charge(message["amount"], message["card_token"])
    db.mark_payment_complete(payment_id)

# Without this: duplicate charge on retry.
# With this: second delivery is a no-op. ✓

Queue delivers to one. Pub/Sub delivers to all. Get this wrong and you'll either starve consumers or duplicate work across all of them.

There's a word for this: idempotency. It means the second call does nothing if the first one already worked. Stripe built it into their payment API from day one. Most queue-related incidents I've seen — duplicate charges, double emails, reports generated twice — came down to idempotency missing somewhere in the pipeline.

💬 Beginner answer: "Add Kafka or SQS to decouple services."
Strong answer: Add a queue, but design consumers to be idempotent — queues guarantee at-least-once delivery, not exactly-once. Retries will happen; processing the same message twice should produce the same result as processing it once.

Quick Reference: Surface Answer vs Strong Answer

Concept	What most candidates say	What lands in interviews
Caching	"Add Redis"	TTL strategy, invalidation on write, cache-aside vs write-through
Sharding	"Split the DB"	Last resort — exhaust vertical scaling, replicas, caching first
Consistent hashing	"Distributes keys evenly"	Why adding servers shouldn't remap everything
CAP theorem	"Pick any two"	CP vs AP during network partitions — with a real example applied
Message queues	"Decouples services"	At-least-once delivery, duplicate handling, idempotency

The Pattern Behind All 5 Mistakes

The mistake was never "I didn't know Redis" or "I didn't know Kafka."

The mistake was treating components as answers.

Every component in a distributed system creates a new problem:

Caches create invalidation
Shards create routing
Queues create retries
Replicas create consistency trade-offs
More servers create key remapping

The interview isn't testing whether you know the component. It's testing whether you know the consequence.

That's the only pattern worth memorising.

The Real Interview Starts Before You Pick Up the Pen

Every concept here has a surface answer and a real answer.

Surface answers get you through the definition check. Real answers are what separate candidates who studied from engineers who've built and broken these systems.

The candidate who got "exactly right" out loud didn't know more patterns than me. He asked one question before drawing a single box:

"What scale are we actually targeting here?"

That question. Every time. Before the pen touches the whiteboard.

Before naming any component, ask three things: What problem does it solve? What new problem does it create? What signal would tell me it's worth the trade-off?

Found this useful? I write about system design, engineering interviews, and real production systems. Follow along — more coming.