DEV Community

Cover image for RAG Explained: How Retrieval-Augmented Generation Actually Works
Suraj Sharma
Suraj Sharma

Posted on

RAG Explained: How Retrieval-Augmented Generation Actually Works

RAG Pipeline Diagram

The Two Phases of RAG

RAG (Retrieval-Augmented Generation) splits into two separate pipelines:

  • Ingestion pipeline — runs once (or on a schedule) to process your documents
  • Query pipeline — runs live for every user request

Why Not Just Send All Your Text to the LLM?

Three hard problems:

  1. Cost — millions of tokens per query = $$$
  2. Context limits — even 128K token windows can't hold an entire knowledge base
  3. Quality — LLMs get confused when buried in irrelevant text

RAG surgically extracts only the relevant 3–5 chunks needed for each question.


Why Store Vectors Instead of Just Doing Text Search?

Keywords only find exact word matches. Vectors capture meaning.

These three phrases are completely different strings — but nearly identical vectors:

"Refunds take 5 days"
"money-back in a week"
"reimbursement timeline: 5 business days"

They cluster close together in embedding space, which is exactly what we want.


The Ingestion Pipeline (Step by Step)

RAG Chunking Diagram

Why chunk? An LLM has a fixed context window (e.g. 128K tokens). Your knowledge base could be millions of tokens. You can't send it all. Chunking lets you retrieve only the 3–5 most relevant pieces and send those — keeping the prompt small and focused. Overlap prevents losing context at chunk boundaries.

Step 1 — Chunking
Split documents into ~500-token pieces with overlap so no idea gets cut off at a boundary.

Step 2 — Embedding
The embedding model (e.g. text-embedding-3-small) converts each chunk into a vector of ~1536 numbers.

Step 3 — Storage
Both the vector and the original text are stored in the vector DB together — you need the text back when it's retrieved later.


The Query Pipeline (Step by Step)

Step 1 — Embed the question
When a user asks a question, it goes through the exact same embedding model (critical — different models produce incompatible vector spaces).

Step 2 — Similarity search
The resulting query vector is compared against all stored chunk vectors using cosine similarity — essentially "which direction in space does this point?"

Step 3 — Retrieve and inject
The top-K most similar chunks are pulled out with their original text and packed into the LLM's prompt as context.


Why a Vector DB Specifically?

Finding the 5 nearest vectors out of 10 million rows needs to happen in under 100ms.

Algorithms like HNSW (Hierarchical Navigable Small World) do this efficiently. A regular SQL database would have to compare every single row one by one — completely impractical at scale.

Popular tools built for this exact problem:

Tool Type
Pinecone Managed cloud
Weaviate Open source / cloud
Chroma Lightweight / local
pgvector Postgres extension

Summary

RAG is the practical answer to the question: "How do I give an LLM access to my knowledge base without it being slow, expensive, or hallucinating?"

The key insight is that retrieval and generation are separate concerns — get retrieval right first, and the generation almost takes care of itself.


Found this useful? Drop a ❤️ or share it with someone building LLM-powered apps.

Top comments (0)