The Two Phases of RAG
RAG (Retrieval-Augmented Generation) splits into two separate pipelines:
- Ingestion pipeline — runs once (or on a schedule) to process your documents
- Query pipeline — runs live for every user request
Why Not Just Send All Your Text to the LLM?
Three hard problems:
- Cost — millions of tokens per query = $$$
- Context limits — even 128K token windows can't hold an entire knowledge base
- Quality — LLMs get confused when buried in irrelevant text
RAG surgically extracts only the relevant 3–5 chunks needed for each question.
Why Store Vectors Instead of Just Doing Text Search?
Keywords only find exact word matches. Vectors capture meaning.
These three phrases are completely different strings — but nearly identical vectors:
"Refunds take 5 days"
"money-back in a week"
"reimbursement timeline: 5 business days"
They cluster close together in embedding space, which is exactly what we want.
The Ingestion Pipeline (Step by Step)
Why chunk? An LLM has a fixed context window (e.g. 128K tokens). Your knowledge base could be millions of tokens. You can't send it all. Chunking lets you retrieve only the 3–5 most relevant pieces and send those — keeping the prompt small and focused. Overlap prevents losing context at chunk boundaries.
Step 1 — Chunking
Split documents into ~500-token pieces with overlap so no idea gets cut off at a boundary.
Step 2 — Embedding
The embedding model (e.g. text-embedding-3-small) converts each chunk into a vector of ~1536 numbers.
Step 3 — Storage
Both the vector and the original text are stored in the vector DB together — you need the text back when it's retrieved later.
The Query Pipeline (Step by Step)
Step 1 — Embed the question
When a user asks a question, it goes through the exact same embedding model (critical — different models produce incompatible vector spaces).
Step 2 — Similarity search
The resulting query vector is compared against all stored chunk vectors using cosine similarity — essentially "which direction in space does this point?"
Step 3 — Retrieve and inject
The top-K most similar chunks are pulled out with their original text and packed into the LLM's prompt as context.
Why a Vector DB Specifically?
Finding the 5 nearest vectors out of 10 million rows needs to happen in under 100ms.
Algorithms like HNSW (Hierarchical Navigable Small World) do this efficiently. A regular SQL database would have to compare every single row one by one — completely impractical at scale.
Popular tools built for this exact problem:
| Tool | Type |
|---|---|
| Pinecone | Managed cloud |
| Weaviate | Open source / cloud |
| Chroma | Lightweight / local |
| pgvector | Postgres extension |
Summary
RAG is the practical answer to the question: "How do I give an LLM access to my knowledge base without it being slow, expensive, or hallucinating?"
The key insight is that retrieval and generation are separate concerns — get retrieval right first, and the generation almost takes care of itself.
Found this useful? Drop a ❤️ or share it with someone building LLM-powered apps.


Top comments (0)