Introduction
RAG does not make your LLM smarter. It gives your LLM a reference sheet.
Retrieval-Augmented Generation is simple in concept: search your documents, inject the relevant chunks into the prompt, and let the model answer using that context. In production, it is a distributed query pipeline where chunking, embedding model choice, index freshness, re-ranking, and context window limits each can silently degrade answer quality.
Most RAG failures are not generation failures. They are retrieval failures. The model answered correctly based on the wrong context you gave it.
Why This Matters
If you have debugged a slow SQL query or a stale cache, you already understand RAG failure modes. The generation step is the visible symptom. The retrieval step is often the root cause.
A support bot that cites the wrong policy version did not necessarily hallucinate. It may have retrieved an outdated chunk from a vector index that was never re-embedded after a docs update. Your job as a backend engineer is to treat RAG like any other data pipeline: measurable, observable, and testable at every stage.
Prerequisites
This article assumes you have read Blog 001. You should understand that LLMs are probabilistic token predictors without guaranteed grounding.
The Problem
The naive pattern:
User question -> LLM -> Answer
fails for domain-specific factual questions because the model has no access to your private, current documentation at inference time.
The naive RAG pattern:
User question -> Vector search -> Stuff top-K chunks -> LLM -> Answer
often fails silently because:
- Chunks are too large or too small, splitting tables across boundaries or burying the answer in noise.
- Embeddings miss semantic intent, especially for short queries or domain jargon.
- The index is stale after documentation updates.
- Top-K without re-ranking returns plausible but wrong passages.
- Context overflow truncates the one chunk that contained the answer.
Understanding the Core Concept
RAG separates knowledge storage from language generation:
| Component | Role | Failure mode |
|---|---|---|
| Ingestion | Parse, chunk, embed documents | Bad chunks, lost structure |
| Index | Store vectors for similarity search | Stale vectors, wrong metric |
| Retrieval | Find candidate passages for a query | Low recall, wrong neighbors |
| Re-ranking | Re-order candidates by relevance | Skipped step, latency spike |
| Generation | Synthesize answer from context | Ignores context, hallucinates beyond it |
The LLM is the last mile. Search quality is the first mile.
Chunking strategies
Fixed-size chunks (for example, 512 tokens with overlap) are simple but may split sentences and tables. Semantic chunks split on paragraph or section boundaries for better coherence. Parent-child chunking retrieves small chunks for precision but injects larger parent context for generation.
Overlap, typically 10-20% of chunk size, reduces boundary artifacts where the answer spans two chunks.
Embedding model choice
The embedding model maps text to vectors where cosine similarity approximates semantic relatedness. A mismatch between embedding model and domain (legal, medical, code) hurts recall. For RAG quality, embedding choice often matters more than generator model choice.
Hybrid retrieval
Dense vector search alone struggles with exact identifiers (SKUs, error codes, function names). Combining BM25 keyword search with dense retrieval (hybrid search) improves recall on production workloads.
How It Works Internally (High Level)
- Offline: Documents are parsed, chunked, embedded, and stored in a vector index with optional metadata filters.
- Online: User query is embedded with the same model used at ingest time.
- Search: Approximate nearest neighbor (ANN) search returns top-K candidates.
- Re-rank (optional): A cross-encoder or lightweight reranker scores query-passage pairs.
- Prompt assembly: System instructions plus retrieved passages plus user question.
- Generation: LLM produces an answer constrained by provided context.
Step-by-Step Example
Question: "What is the refund window for annual plans?"
- Embed the query.
- Search index for top-5 chunks by cosine similarity.
- Re-rank so the passage about annual billing rises to the top.
- Assemble prompt with system rule: "Answer only from context. If unknown, say so."
- Generate with low temperature (0.1-0.3).
- Log query hash, chunk IDs, scores, and latency per stage.
If retrieval returns a chunk about monthly plans only, the model will answer confidently about monthly plans. Debug retrieval first.
Architecture
Highlight retrieval in your monitoring. That is where most failures happen.
Python Example
Minimal RAG retrieval loop using sentence embeddings and cosine similarity. For production, use a dedicated vector database.
"""
Minimal RAG retrieval demo.
Requires: pip install sentence-transformers numpy
"""
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
DOCUMENTS = [
"Annual plans: refunds available within 14 days of purchase.",
"Monthly plans: no refunds after billing cycle starts.",
"Enterprise plans: custom refund terms per contract.",
]
def embed_texts(texts: list[str]) -> np.ndarray:
vectors = model.encode(texts, normalize_embeddings=True)
return np.array(vectors)
def retrieve(query: str, docs: list[str], top_k: int = 2) -> list[tuple[str, float]]:
doc_vectors = embed_texts(docs)
query_vector = embed_texts([query])[0]
scores = doc_vectors @ query_vector
ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
return ranked[:top_k]
def build_prompt(query: str, contexts: list[str]) -> str:
joined = "\n\n".join(f"- {c}" for c in contexts)
return (
f"Context:\n{joined}\n\n"
f"Question: {query}\n\n"
"Answer using only the context above."
)
if __name__ == "__main__":
query = "Can I get a refund on an annual subscription?"
hits = retrieve(query, DOCUMENTS, top_k=2)
for doc, score in hits:
print(f"[{score:.3f}] {doc}")
prompt = build_prompt(query, [d for d, _ in hits])
print("\n--- Prompt ---\n", prompt)
Add chunking pipeline, metadata filters, reranker, and an eval set with recall@K metrics.
Real-World Applications
- Internal documentation Q&A over wikis and PDFs
- Customer support bots grounded in help center articles
- Code assistants retrieving relevant files from a repository index
- Compliance workflows requiring citations to source documents
- Hybrid search combining BM25 keyword match with dense embeddings
Performance Considerations
- Latency: Retrieval adds 50-300ms depending on index size, reranker, and filters. Budget it in your SLA.
- Cost: Embedding at ingest time plus query-time embedding. Re-embedding an entire corpus on model swap is expensive.
- Freshness: Event-driven re-index on document change beats nightly batch for accuracy.
- Recall vs precision: Higher top-K improves recall but increases prompt tokens and noise.
Common Mistakes
- Debugging prompts before debugging retrieval.
- No eval set with labeled query-document pairs.
- Chunking PDFs naively, destroying tables and lists.
- Skipping re-ranking when top-K ANN results are noisy.
- Assuming the LLM will ignore irrelevant context.
Interview Questions
Q1: What is RAG in one sentence?
A: A pattern that retrieves relevant documents at query time and injects them into the LLM prompt so answers are grounded in external knowledge.
Q2: Where do most RAG failures occur?
A: In retrieval: wrong chunks, stale index, poor embeddings, or insufficient recall.
Q3: How does RAG differ from fine-tuning for knowledge?
A: RAG injects facts at query time from an updatable index. Fine-tuning changes model weights and behavior but does not reliably store volatile facts.
Q4: What metrics would you track for a RAG pipeline?
A: Recall@K, MRR, retrieval latency, rerank latency, faithfulness, citation accuracy, and end-to-end answer correctness.
Q5: Why use chunk overlap?
A: To prevent answers from being split across chunk boundaries where neither chunk alone contains the full answer.
Q6: When would you add a re-ranker?
A: When ANN search returns semantically nearby but task-irrelevant passages, especially with short queries or large corpora.
Production Checklist
Before shipping RAG to production, verify each layer:
- Ingestion idempotency: Re-running ingest on the same document produces the same chunk IDs or upserts cleanly.
- Version tags: Store embedding model name and version on every vector.
- Retrieval eval: Recall@5 above your threshold on a labeled set of at least 100 queries.
- Faithfulness eval: Answers cite only retrieved text on a held-out Q&A set.
- Latency budget: p95 retrieval under your SLA (often 200ms excluding LLM).
- Failure logging: Log empty retrieval, low scores, and truncated context.
Index freshness patterns
| Pattern | Freshness | Complexity |
|---|---|---|
| Nightly batch | Hours stale | Low |
| Event-driven on doc update | Minutes | Medium |
| Write-through on publish | Near real-time | High |
For policy and pricing docs, event-driven re-index is usually worth the engineering cost.
When RAG is not enough
RAG handles lookup over static or slow-changing text. It struggles with:
- Real-time transactional data (use tools or SQL instead)
- Multi-hop reasoning across many documents (consider agent workflows or graph retrieval)
- Computations (the model may hallucinate arithmetic; use a calculator tool)
Combine RAG with MCP tools (Blog 003) when answers require live system state.
Design Tradeoffs: Chunk Size
| Chunk size | Pros | Cons |
|---|---|---|
| 128-256 tokens | Precise retrieval | May lack surrounding context |
| 512-1024 tokens | Good default for prose | Tables may split awkwardly |
| 2000+ tokens | Full section context | Lower precision, higher noise |
Tune on your eval set. Legal and API docs often need structure-aware chunking (by heading or OpenAPI operation), not fixed token windows.
Extended Walkthrough: Debugging a Wrong Answer
A user asks: "Do annual plans include phone support?"
- Check generation: model cited "Priority email support for all plans."
- Check retrieval logs: chunk
support_tiers_v3.md#chunk-14scored highest. - Open chunk: it describes monthly plans only; annual tier is chunk-22.
- Root cause: embedding confused "annual" and "monthly" in short query.
- Fix: add metadata filter
plan_type=annualwhen query classifier detects billing intent; add reranker; expand eval queries.
Without retrieval logs, you would have tweaked the system prompt for days.
Observability Fields
Log per request:
query_text_hashembedding_model_version-
retrieved_chunk_idswith scores -
rerank_scoresif applicable prompt_token_count-
answer_faithfulness_scoreif automated
These fields make RAG debuggable like any distributed pipeline.
Eval Metrics Reference
| Metric | Measures | Target direction |
|---|---|---|
| Recall@K | Relevant doc in top K | Higher |
| MRR | Rank of first relevant | Higher |
| nDCG | Graded relevance ranking | Higher |
| Faithfulness | Answer supported by context | Higher |
| Answer correctness | End-to-end vs gold | Higher |
| Latency p95 | Retrieval plus generation | Lower |
Run retrieval and generation evals separately before combining. A 10% retrieval recall gain often beats switching to a more expensive LLM.
Anti-Patterns in RAG Products
- Dump entire wiki: Retrieval exists but pipeline sends 50 random pages.
- No citations: User cannot verify answers; trust erodes on first error.
- Single embedding for code and prose: Split indexes or use hybrid search.
- Ignore ACLs: Vector index returns docs the user cannot access.
- Skip ACL sync on delete: Removed permissions still retrievable until reindex.
Enforce document-level permissions at retrieval time, not only at the UI.
Summary
RAG is a search pipeline with an LLM at the end. Treat chunking, embedding, indexing, and retrieval as first-class engineering problems. Measure retrieval quality before tuning generation. Your RAG system is only as good as the search layer underneath it.
Further Reading
- Lewis et al.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- LlamaIndex and LangChain RAG documentation
- BEIR benchmark for retrieval evaluation

Top comments (0)