Suraj Sharma

Posted on May 25 • Edited on May 27

RAG Explained: How Retrieval-Augmented Generation Actually Works

#ai #llm #rag #machinelearning

The Two Phases of RAG

RAG (Retrieval-Augmented Generation) splits into two separate pipelines:

Ingestion pipeline — runs once (or on a schedule) to process your documents
Query pipeline — runs live for every user request

Why Not Just Send All Your Text to the LLM?

Three hard problems:

Cost — millions of tokens per query = $$$
Context limits — even 128K token windows can't hold an entire knowledge base
Quality — LLMs get confused when buried in irrelevant text

RAG surgically extracts only the relevant 3–5 chunks needed for each question.

Why Store Vectors Instead of Just Doing Text Search?

Keywords only find exact word matches. Vectors capture meaning.

These three phrases are completely different strings — but nearly identical vectors:

"Refunds take 5 days"
"money-back in a week"
"reimbursement timeline: 5 business days"

They cluster close together in embedding space, which is exactly what we want.

The Ingestion Pipeline (Step by Step)

Why chunk? An LLM has a fixed context window (e.g. 128K tokens). Your knowledge base could be millions of tokens. You can't send it all. Chunking lets you retrieve only the 3–5 most relevant pieces and send those — keeping the prompt small and focused. Overlap prevents losing context at chunk boundaries.

Step 1 — Chunking
Split documents into ~500-token pieces with overlap so no idea gets cut off at a boundary.

Step 2 — Embedding
The embedding model (e.g. text-embedding-3-small) converts each chunk into a vector of ~1536 numbers.

Step 3 — Storage
Both the vector and the original text are stored in the vector DB together — you need the text back when it's retrieved later.

The Query Pipeline (Step by Step)

Step 1 — Embed the question
When a user asks a question, it goes through the exact same embedding model (critical — different models produce incompatible vector spaces).

Step 2 — Similarity search
The resulting query vector is compared against all stored chunk vectors using cosine similarity — essentially "which direction in space does this point?"

Step 3 — Retrieve and inject
The top-K most similar chunks are pulled out with their original text and packed into the LLM's prompt as context.

Why a Vector DB Specifically?

Finding the 5 nearest vectors out of 10 million rows needs to happen in under 100ms.

Algorithms like HNSW (Hierarchical Navigable Small World) do this efficiently. A regular SQL database would have to compare every single row one by one — completely impractical at scale.

Popular tools built for this exact problem:

Tool	Type
Pinecone	Managed cloud
Weaviate	Open source / cloud
Chroma	Lightweight / local
pgvector	Postgres extension

Summary

RAG is the practical answer to the question: "How do I give an LLM access to my knowledge base without it being slow, expensive, or hallucinating?"

The key insight is that retrieval and generation are separate concerns — get retrieval right first, and the generation almost takes care of itself.

Found this useful? Drop a ❤️ or share it with someone building LLM-powered apps.

Top comments (2)

Hartmut B. • May 26

Well written.
Personaly, I miss a section where RAG and LLM-Wiki usecases are discussed.
Due to the popularity of hermes and the included obsidian-based ingestrion pipeline this might interest many people.

Harjot Singh • May 31

Clean explanation, and the part most RAG intros underweight is the third reason you listed: quality, not just cost and context limits. People frame RAG as a money-saver (don't send millions of tokens) but the deeper win is that stuffing the whole knowledge base in actively makes answers worse, the model anchors on the wrong passage when buried in irrelevant text, the same way a person skim-reading grabs the first plausible line. Surgically pulling 3-5 relevant chunks isn't just cheaper, it's how you get a sharper answer. Which makes retrieval quality the real ballgame, RAG is only as good as the chunks it fetches, and a confident answer built on the wrong-but-similar chunk is worse than no answer because it looks grounded. So the underrated work isn't the generation, it's chunking strategy, reranking, and knowing when retrieval came back empty so the model abstains instead of confabulating. Garbage-in-confident-garbage-out. That retrieval-quality-decides-everything view is core to how I build with RAG in Moonshift. What moved your answer quality most, better chunking or adding a reranker on top of vector search?