DEV Community

Cover image for What is RAG? A Beginner's Guide to Retrieval-Augmented Generation (With a Full Pipeline Walkthrough)
Ege Pakten
Ege Pakten

Posted on

What is RAG? A Beginner's Guide to Retrieval-Augmented Generation (With a Full Pipeline Walkthrough)

If you've ever wondered how ChatGPT-style apps can suddenly "know" about your company's internal documents, product manuals, or legal files without being retrained, the answer is almost always RAG — Retrieval-Augmented Generation. In this post, we'll break down what RAG is, why it exists, and walk through the full pipeline step-by-step with a real example.


1. What is RAG?

Retrieval-Augmented Generation (RAG) is an AI framework that integrates an information retrieval component into the generation process of Large Language Models (LLMs) to improve factuality and relevance.

In plain English:

Instead of making the LLM remember everything, we let it look things up in a knowledge base right before answering.

The term RAG was coined in a 2020 research paper by Patrick Lewis et al. ("Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks") published on arXiv. The core insight: combine a parametric memory (the LLM's weights) with a non-parametric memory (a searchable document store) — and you get the best of both worlds.


2. Why RAG? The Motivation

Three big problems drove the invention of RAG:

LLM Limitations

LLMs are frozen snapshots. Once a model is trained, it only knows what was in its training data. It doesn't know:

  • What your company policies say
  • What happened after its training cutoff
  • What's in your private documents
  • What yesterday's sales numbers were

And even with what it does know, it can hallucinate confidently.

Cost of Retraining vs. Dynamic Retrieval

You could retrain or fine-tune the model every time your data changes. But:

  • Retraining a large model can cost tens of thousands to millions of dollars
  • It takes days or weeks
  • You have to do it again every time the data updates

Dynamic retrieval (looking things up at query time) is vastly cheaper and always up-to-date.

Need for Grounded, Up-to-Date Knowledge

For regulated industries (finance, healthcare, legal), you can't ship answers that come from "the model's memory." You need answers backed by sources you can cite and audit.

RAG addresses all three challenges by decoupling knowledge from the model.


3. The RAG Pipeline — Step-by-Step With a Real Example

This is the part most tutorials rush through. We're going to slow down.

Let's use a concrete example. Imagine you're building an internal developer assistant at a company called Acme Corp. Employees can ask it questions about the engineering handbook, API docs, and on-call runbooks.

A developer asks:

"How do I rotate the database credentials for the billing service?"

Here's exactly what happens behind the scenes.


Phase 1: Indexing (Done Once, Ahead of Time)

Before anyone can ask anything, we need to prepare the knowledge base.

Step 1a — Knowledge Corpus

First, we gather every document we want the assistant to know about:

  • The engineering handbook (Markdown files)
  • API documentation (HTML + Swagger specs)
  • Runbooks (Confluence pages)
  • Past incident post-mortems (Google Docs)
  • Security policies (PDFs)

Let's say this gives us 8,000 documents.

Step 1b — Document Chunking

An LLM can't efficiently search through a 50-page PDF. And you don't want to return a whole 50-page PDF to the user either — you want the one paragraph that actually answers their question.

So we chunk each document into smaller pieces. A common approach:

  • 500 tokens per chunk (~300 words)
  • 50 token overlap between chunks (so we don't split an idea across a boundary)

One chunk in our knowledge base might look like this:

[Chunk #4729 — Source: runbooks/billing-service.md]
"To rotate database credentials for the billing service:
1. Generate a new password in AWS Secrets Manager.
2. Update the 'billing-db' secret with the new value.
3. Trigger a rolling restart via: kubectl rollout restart deploy/billing.
4. Verify health endpoints return 200 OK.
5. Revoke the old credentials after 24h grace period."
Enter fullscreen mode Exit fullscreen mode

After chunking, our 8,000 documents become maybe 120,000 chunks.

Step 1c — Vector Embeddings

For each chunk, we call an embedding model (like BERT, OpenAI's text-embedding-3-small, or Cohere's embedder). This turns each chunk into a vector — a list of ~1,536 numbers that represents the meaning of that chunk.

Chunk #4729  [0.12, -0.08, 0.44, ..., 0.91]   (1,536 numbers)
Enter fullscreen mode Exit fullscreen mode

Step 1d — Vector Database

We store all 120,000 of these vectors in a vector database — something like FAISS, Pinecone, Weaviate, Milvus, or Qdrant. The database indexes them so we can search across all of them in milliseconds.

Indexing is done. This usually runs as a background job, and you only re-run it when documents change.


Phase 2: Retrieval (Happens at Query Time)

Now a developer types:

"How do I rotate the database credentials for the billing service?"

Step 2a — User Query

The question comes in as plain text.

Step 2b — Query Embedding

We run the same embedding model on the question, producing a query vector:

Query  [0.15, -0.11, 0.48, ..., 0.87]
Enter fullscreen mode Exit fullscreen mode

This is critical: you must embed the query with the same model you used to embed the chunks, otherwise the vectors live in different spaces and similarity becomes meaningless.

Step 2c — Similarity Search

Now we ask the vector database: "Which chunks have vectors closest to this query vector?"

Closeness is measured with a similarity metric, most commonly cosine similarity — it measures the angle between two vectors. The smaller the angle, the more similar the meaning.

Under the hood, the database uses Approximate Nearest Neighbors (ANN) tricks to search 120,000 vectors in ~5 milliseconds instead of comparing one by one.

Step 2d — Relevant Passages

The database returns the top-k most similar chunks (typically k=3 to k=10). For our query, we might get:

1. Chunk #4729 (score 0.94) — billing-service runbook, credential rotation
2. Chunk #3180 (score 0.89) — AWS Secrets Manager general guide
3. Chunk #5512 (score 0.85) — rolling restart playbook
Enter fullscreen mode Exit fullscreen mode

These are the passages most likely to contain the answer.


Phase 3: Augmentation

Now we have relevant chunks, but we don't just show them to the user. We want the LLM to write a nice, synthesized answer using them.

Step 3a — Original Prompt

The user's raw question:

"How do I rotate the database credentials for the billing service?"
Enter fullscreen mode Exit fullscreen mode

Step 3b — Augmented Prompt

We wrap it in a prompt template that injects the retrieved chunks as context:

You are Acme Corp's internal engineering assistant.
Answer the user's question using ONLY the context below.
If the answer isn't in the context, say you don't know.

---CONTEXT---
[Chunk #4729]: To rotate database credentials for the billing
service: 1. Generate a new password in AWS Secrets Manager...
[Chunk #3180]: AWS Secrets Manager allows you to store and
rotate database credentials...
[Chunk #5512]: Rolling restarts can be triggered via kubectl
rollout restart deploy/<service-name>...
---END CONTEXT---

USER QUESTION: How do I rotate the database credentials
for the billing service?
Enter fullscreen mode Exit fullscreen mode

This is the "A" in RAG — Augmentation. We're augmenting the prompt with retrieved knowledge.


Phase 4: Generation

Step 4a — LLM Call

We send the augmented prompt to an LLM (GPT-3/4/5, Claude, LLaMA, T5, Mistral — your choice).

Because the model has the actual documentation right there in its context window, it doesn't need to remember anything — it just reads and synthesizes.

Step 4b — Final Response

The model produces a grounded answer:

"To rotate the database credentials for the billing service:

  1. Generate a new password in AWS Secrets Manager.
  2. Update the billing-db secret with the new value.
  3. Trigger a rolling restart with kubectl rollout restart deploy/billing.
  4. Verify that the health endpoints return 200 OK.
  5. Revoke the old credentials after a 24-hour grace period.

Source: runbooks/billing-service.md"

Notice three things:

  1. The answer is accurate — it comes from your actual docs.
  2. The answer is current — if you update the runbook, the next query uses the new version. No retraining needed.
  3. The answer can be cited — you know exactly which document it came from.

That's the whole RAG pipeline. Indexing → Retrieval → Augmentation → Generation.


4. The Retrieval Component in Detail

Three pieces make retrieval work:

Embedding Models

The model that turns text into vectors. Examples: BERT, text-embedding-3-small, Cohere Embed, Sentence-BERT. Choose one that's trained well for your language and domain.

Vector Stores

Databases optimized for vector similarity search. Popular options: FAISS (local, Facebook), Pinecone (managed), Weaviate, Milvus, Qdrant, and pgvector (Postgres extension).

Similarity Metrics

How we measure "closeness" between vectors. The go-to is cosine similarity, but Euclidean distance and dot product also show up. Cosine similarity is popular because it ignores vector length and focuses on direction — which is what semantic meaning lives in.


5. Augmentation & Generation in Detail

Prompt Templates

The structure that tells the LLM how to use the retrieved context. Good templates specify:

  • The assistant's role
  • What to do if context is missing
  • Output format (JSON, bullet points, prose)
  • Citation rules

Managing Model Context

The LLM only has so much context window. If retrieval returns 30 chunks but each chunk is 500 tokens, that's 15,000 tokens just for context. You have to:

  • Pick top-k carefully (more isn't always better)
  • Rerank retrieved chunks
  • Sometimes summarize chunks before injection

LLM Choices

Any generative LLM can work: GPT-3/4/5, T5, LLaMA, Claude, Mistral, Gemini. The RAG pipeline is mostly model-agnostic.


6. Applications and Benefits

RAG is behind a huge number of real-world AI products:

  • Knowledge-centric chatbots — customer support bots grounded in your docs
  • Document summarization & Q&A — ask questions about contracts, research papers, medical records
  • Enterprise search & knowledge management — "Glean for your company" style tools

Benefits:

  • No retraining required when data changes
  • Answers are traceable back to sources
  • Private data stays in your vector DB — never baked into model weights
  • Cheaper than fine-tuning for most use cases
  • Can mix multiple knowledge bases with one model

7. Challenges and Future Directions

RAG isn't magic. Here are the real tradeoffs.

Source Reliability & Bias

Garbage in, garbage out. If your knowledge base has outdated or biased content, your RAG system will confidently repeat it. Curation matters.

Latency & System Complexity

A RAG query is actually: embed → ANN search → rerank → build prompt → LLM call. That's a lot of moving parts. Each step adds latency, and each step can fail. Production RAG systems require serious observability.

Privacy & Security Safeguards

Your vector DB now contains sensitive content. Access control, encryption, and embedding leakage (yes, embeddings can leak information) all need attention.

Research Frontiers

Where RAG is heading:

  • Multi-hop retrieval — answering questions that need multiple retrieval rounds (e.g., "Find X, then look up Y for X")
  • Adapters — lightweight modules that specialize the LLM for using retrieved content better
  • Self-improving retrieval — the model learns which retrievals helped and which didn't

Wrapping Up — The TL;DR

What is RAG?
A framework that lets an LLM look things up in a knowledge base before answering, so its responses are grounded in real, current, specific information.

Why does it exist?
Because retraining is expensive, LLMs hallucinate, and most real-world apps need to answer from your data — not what the model memorized during training.

How does the pipeline work?

  1. Indexing — Chunk documents → embed each chunk → store vectors in a vector DB.
  2. Retrieval — Embed the user query → find nearest chunks with cosine similarity.
  3. Augmentation — Inject retrieved chunks into a prompt template.
  4. Generation — Send augmented prompt to an LLM → return grounded answer.

Where do you use it?
Anywhere you need an AI that can answer from your own content: internal docs, product support, legal research, medical Q&A, academic search, knowledge management, developer assistants, and more.

Once you understand the four-phase pipeline — Indexing, Retrieval, Augmentation, Generation — every RAG system you encounter becomes a variation on the same theme.


If this helped you finally "get" RAG, drop a reaction. More notes coming soon.

Top comments (0)