Ameer Hamza

Posted on Jul 5

Retrieval-Augmented Generation (RAG) for Backend Engineers

#ai #rag #programming #python

Introduction

RAG does not make your LLM smarter. It gives your LLM a reference sheet.

Retrieval-Augmented Generation is simple in concept: search your documents, inject the relevant chunks into the prompt, and let the model answer using that context. In production, it is a distributed query pipeline where chunking, embedding model choice, index freshness, re-ranking, and context window limits each can silently degrade answer quality.

Most RAG failures are not generation failures. They are retrieval failures. The model answered correctly based on the wrong context you gave it.

Why This Matters

If you have debugged a slow SQL query or a stale cache, you already understand RAG failure modes. The generation step is the visible symptom. The retrieval step is often the root cause.

A support bot that cites the wrong policy version did not necessarily hallucinate. It may have retrieved an outdated chunk from a vector index that was never re-embedded after a docs update. Your job as a backend engineer is to treat RAG like any other data pipeline: measurable, observable, and testable at every stage.

Prerequisites

This article assumes you have read Blog 001. You should understand that LLMs are probabilistic token predictors without guaranteed grounding.

The Problem

The naive pattern:

User question -> LLM -> Answer

fails for domain-specific factual questions because the model has no access to your private, current documentation at inference time.

The naive RAG pattern:

User question -> Vector search -> Stuff top-K chunks -> LLM -> Answer

often fails silently because:

Chunks are too large or too small, splitting tables across boundaries or burying the answer in noise.
Embeddings miss semantic intent, especially for short queries or domain jargon.
The index is stale after documentation updates.
Top-K without re-ranking returns plausible but wrong passages.
Context overflow truncates the one chunk that contained the answer.

Understanding the Core Concept

RAG separates knowledge storage from language generation:

Component	Role	Failure mode
Ingestion	Parse, chunk, embed documents	Bad chunks, lost structure
Index	Store vectors for similarity search	Stale vectors, wrong metric
Retrieval	Find candidate passages for a query	Low recall, wrong neighbors
Re-ranking	Re-order candidates by relevance	Skipped step, latency spike
Generation	Synthesize answer from context	Ignores context, hallucinates beyond it

The LLM is the last mile. Search quality is the first mile.

Chunking strategies

Fixed-size chunks (for example, 512 tokens with overlap) are simple but may split sentences and tables. Semantic chunks split on paragraph or section boundaries for better coherence. Parent-child chunking retrieves small chunks for precision but injects larger parent context for generation.

Overlap, typically 10-20% of chunk size, reduces boundary artifacts where the answer spans two chunks.

Embedding model choice

The embedding model maps text to vectors where cosine similarity approximates semantic relatedness. A mismatch between embedding model and domain (legal, medical, code) hurts recall. For RAG quality, embedding choice often matters more than generator model choice.

Hybrid retrieval

Dense vector search alone struggles with exact identifiers (SKUs, error codes, function names). Combining BM25 keyword search with dense retrieval (hybrid search) improves recall on production workloads.

How It Works Internally (High Level)

Offline: Documents are parsed, chunked, embedded, and stored in a vector index with optional metadata filters.
Online: User query is embedded with the same model used at ingest time.
Search: Approximate nearest neighbor (ANN) search returns top-K candidates.
Re-rank (optional): A cross-encoder or lightweight reranker scores query-passage pairs.
Prompt assembly: System instructions plus retrieved passages plus user question.
Generation: LLM produces an answer constrained by provided context.

Step-by-Step Example

Question: "What is the refund window for annual plans?"

Embed the query.
Search index for top-5 chunks by cosine similarity.
Re-rank so the passage about annual billing rises to the top.
Assemble prompt with system rule: "Answer only from context. If unknown, say so."
Generate with low temperature (0.1-0.3).
Log query hash, chunk IDs, scores, and latency per stage.

If retrieval returns a chunk about monthly plans only, the model will answer confidently about monthly plans. Debug retrieval first.

Architecture

Highlight retrieval in your monitoring. That is where most failures happen.

Python Example

Minimal RAG retrieval loop using sentence embeddings and cosine similarity. For production, use a dedicated vector database.

"""
Minimal RAG retrieval demo.
Requires: pip install sentence-transformers numpy
"""
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

DOCUMENTS = [
    "Annual plans: refunds available within 14 days of purchase.",
    "Monthly plans: no refunds after billing cycle starts.",
    "Enterprise plans: custom refund terms per contract.",
]

def embed_texts(texts: list[str]) -> np.ndarray:
    vectors = model.encode(texts, normalize_embeddings=True)
    return np.array(vectors)

def retrieve(query: str, docs: list[str], top_k: int = 2) -> list[tuple[str, float]]:
    doc_vectors = embed_texts(docs)
    query_vector = embed_texts([query])[0]
    scores = doc_vectors @ query_vector
    ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_k]

def build_prompt(query: str, contexts: list[str]) -> str:
    joined = "\n\n".join(f"- {c}" for c in contexts)
    return (
        f"Context:\n{joined}\n\n"
        f"Question: {query}\n\n"
        "Answer using only the context above."
    )

if __name__ == "__main__":
    query = "Can I get a refund on an annual subscription?"
    hits = retrieve(query, DOCUMENTS, top_k=2)
    for doc, score in hits:
        print(f"[{score:.3f}] {doc}")
    prompt = build_prompt(query, [d for d, _ in hits])
    print("\n--- Prompt ---\n", prompt)

Add chunking pipeline, metadata filters, reranker, and an eval set with recall@K metrics.

Real-World Applications

Internal documentation Q&A over wikis and PDFs
Customer support bots grounded in help center articles
Code assistants retrieving relevant files from a repository index
Compliance workflows requiring citations to source documents
Hybrid search combining BM25 keyword match with dense embeddings

Performance Considerations

Latency: Retrieval adds 50-300ms depending on index size, reranker, and filters. Budget it in your SLA.
Cost: Embedding at ingest time plus query-time embedding. Re-embedding an entire corpus on model swap is expensive.
Freshness: Event-driven re-index on document change beats nightly batch for accuracy.
Recall vs precision: Higher top-K improves recall but increases prompt tokens and noise.

Common Mistakes

Debugging prompts before debugging retrieval.
No eval set with labeled query-document pairs.
Chunking PDFs naively, destroying tables and lists.
Skipping re-ranking when top-K ANN results are noisy.
Assuming the LLM will ignore irrelevant context.

Interview Questions

Q1: What is RAG in one sentence?

A: A pattern that retrieves relevant documents at query time and injects them into the LLM prompt so answers are grounded in external knowledge.

Q2: Where do most RAG failures occur?

A: In retrieval: wrong chunks, stale index, poor embeddings, or insufficient recall.

Q3: How does RAG differ from fine-tuning for knowledge?

A: RAG injects facts at query time from an updatable index. Fine-tuning changes model weights and behavior but does not reliably store volatile facts.

Q4: What metrics would you track for a RAG pipeline?

A: Recall@K, MRR, retrieval latency, rerank latency, faithfulness, citation accuracy, and end-to-end answer correctness.

Q5: Why use chunk overlap?

A: To prevent answers from being split across chunk boundaries where neither chunk alone contains the full answer.

Q6: When would you add a re-ranker?

A: When ANN search returns semantically nearby but task-irrelevant passages, especially with short queries or large corpora.

Production Checklist

Before shipping RAG to production, verify each layer:

Ingestion idempotency: Re-running ingest on the same document produces the same chunk IDs or upserts cleanly.
Version tags: Store embedding model name and version on every vector.
Retrieval eval: Recall@5 above your threshold on a labeled set of at least 100 queries.
Faithfulness eval: Answers cite only retrieved text on a held-out Q&A set.
Latency budget: p95 retrieval under your SLA (often 200ms excluding LLM).
Failure logging: Log empty retrieval, low scores, and truncated context.

Index freshness patterns

Pattern	Freshness	Complexity
Nightly batch	Hours stale	Low
Event-driven on doc update	Minutes	Medium
Write-through on publish	Near real-time	High

For policy and pricing docs, event-driven re-index is usually worth the engineering cost.

When RAG is not enough

RAG handles lookup over static or slow-changing text. It struggles with:

Real-time transactional data (use tools or SQL instead)
Multi-hop reasoning across many documents (consider agent workflows or graph retrieval)
Computations (the model may hallucinate arithmetic; use a calculator tool)

Combine RAG with MCP tools (Blog 003) when answers require live system state.

Design Tradeoffs: Chunk Size

Chunk size	Pros	Cons
128-256 tokens	Precise retrieval	May lack surrounding context
512-1024 tokens	Good default for prose	Tables may split awkwardly
2000+ tokens	Full section context	Lower precision, higher noise

Tune on your eval set. Legal and API docs often need structure-aware chunking (by heading or OpenAPI operation), not fixed token windows.

Extended Walkthrough: Debugging a Wrong Answer

A user asks: "Do annual plans include phone support?"

Check generation: model cited "Priority email support for all plans."
Check retrieval logs: chunk support_tiers_v3.md#chunk-14 scored highest.
Open chunk: it describes monthly plans only; annual tier is chunk-22.
Root cause: embedding confused "annual" and "monthly" in short query.
Fix: add metadata filter plan_type=annual when query classifier detects billing intent; add reranker; expand eval queries.

Without retrieval logs, you would have tweaked the system prompt for days.

Observability Fields

Log per request:

query_text_hash
embedding_model_version
retrieved_chunk_ids with scores
rerank_scores if applicable
prompt_token_count
answer_faithfulness_score if automated

These fields make RAG debuggable like any distributed pipeline.

Eval Metrics Reference

Metric	Measures	Target direction
Recall@K	Relevant doc in top K	Higher
MRR	Rank of first relevant	Higher
nDCG	Graded relevance ranking	Higher
Faithfulness	Answer supported by context	Higher
Answer correctness	End-to-end vs gold	Higher
Latency p95	Retrieval plus generation	Lower

Run retrieval and generation evals separately before combining. A 10% retrieval recall gain often beats switching to a more expensive LLM.

Anti-Patterns in RAG Products

Dump entire wiki: Retrieval exists but pipeline sends 50 random pages.
No citations: User cannot verify answers; trust erodes on first error.
Single embedding for code and prose: Split indexes or use hybrid search.
Ignore ACLs: Vector index returns docs the user cannot access.
Skip ACL sync on delete: Removed permissions still retrievable until reindex.

Enforce document-level permissions at retrieval time, not only at the UI.

Summary

RAG is a search pipeline with an LLM at the end. Treat chunking, embedding, indexing, and retrieval as first-class engineering problems. Measure retrieval quality before tuning generation. Your RAG system is only as good as the search layer underneath it.

DEV Community