Let’s talk about one of the most talked-about LLM patterns: RAG. The acronym stands for Retrieval-Augmented Generation.
Before diving in, a quick detour: what’s the difference between search and retrieval? It matters for understanding RAG.
- Search is the act of finding something among many candidates. You search the web for a specific page; you press Ctrl/Cmd+F to search inside a long document.
- Retrieval literally means bringing back or pulling out items from a collection. In a library, you retrieve a specific book from the stacks; in a database, you retrieve a particular row.
In practice, retrieval usually includes search: you first find what you need, then you pull it out to use it. So when people say RAG, they really mean search + fetch relevant items → use them to improve generation. “Search-and-retrieve-augmented generation” would be more precise but also a mouthful. Since we search precisely to retrieve and use the results, “Retrieval-Augmented Generation” is a fine name.
Alright, why did RAG show up in the first place?
The Chronic Problem with LLMs
Decoder-style LLMs do one thing extremely well: predict the next token given the previous tokens. If a model sees “Once upon a,” it learns to continue with “time.” How? During training it ingests a massive amount of text and learns statistical regularities.
But what happens when the model faces inputs outside its training experience? Humans can say “I don’t know.” Models generally can’t - machine learning systems don’t naturally have a concept of knowing that they don’t know. So they still predict the next token as best they can. That’s when hallucinations happen: the model produces fluent, confident, but factually wrong text.
Think of someone who’s incredibly persuasive, hates admitting ignorance, and knows nothing about, say, medieval European history. Ask them about it, and they’ll spin a very convincing yet completely fabricated story. LLMs can behave the same way: plausible text, shaky truth.
Ask a general-purpose model, “Who’s the tallest person on our team?” and it might confidently answer, “Alex is 6′8″, the tallest.” In reality, there may be no Alex on your team—and nobody is 6′8″. The model isn’t lying on purpose; it’s just doing next-token prediction with insufficient grounding.
We’ve tried to mitigate this by training models to admit uncertainty more often. That reduces some blatant hallucinations, but it creates two persistent problems:
- The model sometimes says it doesn’t know—even when the answer should be available somewhere.
- The model still occasionally hallucinates confident, wrong answers.
At the root, both issues happen when the model lacks access to the right facts at generation time.
A Simple Idea to Fix It
Contrary to the hype, LLMs don’t know everything. For creative tasks (drafting emails, wordsmithing, brainstorming), missing facts aren’t always fatal. But for factual tasks, missing or stale knowledge is a big problem.
One “obvious” fix: teach the model your specific facts. For example, to answer “Who’s the tallest person on our team?”, you could fine-tune the LLM with your team’s height data.
Two problems:
- Tiny signal in a huge sea: your small dataset may barely influence a giant model.
- Cost and complexity: even with modern techniques (LoRA/QLoRA, parameter-efficient fine-tuning), high-quality fine-tuning remains non-trivial and expensive—especially if your facts change often.
So a clever idea emerged:
Keep the base LLM as is, and at query time retrieve the relevant facts and feed them into the prompt.
Retrieve → Augment → Generate. RAG.
Instruction Tuning and Prompts (Why They Matter for RAG)
Raw pre-trained LLMs are trained to continue text, not necessarily to follow instructions. Ask a raw model “What’s the highest mountain in the world?” and it might continue with, “That’s a common question...” instead of “Mount Everest.”
Instruction tuning fixes this. We collect pairs of prompts and desired outputs and do an additional round of training so the model learns to follow instructions.
Example training pair:
Input (prompt):
Please answer the question: "What is the highest mountain in the world?"
Output:
Mount Everest.
After instruction tuning, the model is far more likely to answer directly and succinctly. This matters for RAG because, once we augment the prompt with retrieved facts, we need the model to pay attention to those facts and follow the instruction: “answer using the provided context.”
How RAG Works (Step by Step)
Consider the AWS diagram for RAG (shown in many guides):
From: https://aws.amazon.com/ko/what-is/retrieval-augmented-generation/
At a high level:
- User asks a question. The system receives the user query along with an instruction-style prompt (e.g., “Answer the question below.”).
- Retrieve relevant information (documents, snippets, database rows) that likely contain the answer.
- Augment the original prompt with those retrieved passages.
- Generate the answer with the LLM, which now conditions on the augmented prompt.
- Return the answer (optionally with citations).
A concrete example:
- User: “Who’s the tallest person on our team?”
- Retrieval finds a record like: “Team heights — Alex 6′4″, Jordan 5′9″, Emily 5′4″, James 6′2″.”
- Augmented prompt might look like:
Context:
Team heights — Alex 6′4″, Jordan 5′9″, Emily 5′4″, James 6′2″.
Using ONLY the context above, answer the question below.
Question: Who’s the tallest person on our team?
- LLM output: “Alex is the tallest at 6′4″.”
This flow hinges on two capabilities: neural retrieval and instruction-following generation.
Neural Retrieval (a quick tour)
For RAG, most systems use neural retrieval, which measures semantic similarity between the user question and candidate passages.
How it works:
- Embed each document/passsage into a vector using an embedding model. Semantically similar texts map to nearby vectors.
- Store these vectors in a vector database (purpose-built for fast nearest-neighbor search).
- At query time, embed the user’s question and find the nearest vectors in the DB.
- Retrieve the attached passages to feed into the prompt.
This is why we emphasize “retrieval” rather than just “search”: the system doesn’t just find matching strings; it pulls out semantically relevant chunks to be used downstream by the LLM.
Augmented Prompts (and why they usually work)
After retrieval, we stuff the relevant passages into the prompt and tell the model to use them. Thanks to instruction tuning, modern chat models usually follow these instructions well. If they don’t (or if you need higher accuracy), you can fine-tune with RAG-style training data—inputs that include context passages and outputs that cite and constrain themselves to that context.
When Should You Use RAG?
RAG is popular because it’s high-leverage: relatively straightforward to implement and extremely effective at:
- Reducing hallucinations(answers are grounded in retrieved text).
- Injecting missing knowledge without re-training the base model.
Common use cases:
1) Question-answering over fresh information
LLMs only “know” up to their training cutoffs (unless the product offers integrated browsing or tools). For example, GPT-4’s public training cutoff is widely reported as April 2023. Ask it about something that happened this month, and you need to supply context. With RAG, you can retrieve current data (news, docs, analytics) and get an up-to-date answer now.
2) Question-answering over proprietary or niche data
Think company policies, internal wikis, compliance manuals, product specs, or your team’s meeting notes—things that are not in public training data. You could fine-tune a model with these, but you’d have to redo it whenever the docs change. With RAG, you just update your document store; no model retraining required. When someone asks, “How many wellness points do we get per year?”, the system retrieves the HR policy page and the model answers based only on that page.
Putting It Together
RAG isn’t magic—it’s a practical pattern that plugs a big gap in LLMs:
- LLMs are great at language and reasoning patterns, but they’re brittle on facts they don’t have.
- Retrieval brings the right facts to the prompt at the right time.
- Instruction-tuned models then follow directions and generate answers grounded in those facts.
For many real-world applications, RAG offers the best balance of accuracy, maintainability, and cost—often beating naive fine-tuning attempts, especially when your knowledge changes frequently.
Top comments (0)