Part 3 of a series on building reliable AI systems
In Part 1, we explored why testing AI systems is different.
In Part 2, we built evaluation pipelines.
Now let’s focus on one of the most widely used (and misunderstood) patterns:
Retrieval-Augmented Generation (RAG).
RAG is often seen as a solution to hallucinations.
In reality, it just shifts the problem.
The Core Problem with RAG
A typical RAG pipeline looks like this:
User Query
↓
Retriever → Context
↓
LLM → Response
When something goes wrong, it’s not always obvious where the failure is.
- Did retrieval fail?
- Was the context irrelevant?
- Did the model ignore the context?
- Or did it hallucinate anyway?
Without proper evaluation, everything looks like a “model problem.”
RAG Has Two Systems, Not One
This is the key insight:
You are not evaluating a single system—you are evaluating two tightly coupled systems.
- Retriever (search problem)
- Generator (language problem)
If you don’t evaluate them separately, debugging becomes guesswork.
What Should You Measure?
To evaluate RAG properly, you need to break it into components.
1. Retrieval Quality
Question: Did we fetch the right information?
Metrics to consider:
- Top-K relevance
- Context recall (was the correct doc retrieved?)
- Ranking quality
Example failure:
The correct document exists—but wasn’t retrieved.
No model can fix missing context.
2. Context Relevance
Question: Is the retrieved content actually useful?
Even if retrieval “works,” the context may be:
- Noisy
- Partially relevant
- Outdated
This leads to weak or incorrect answers.
3. Grounding / Faithfulness
Question: Did the model use the retrieved context?
This is one of the most critical checks.
Failure patterns:
- Model ignores context
- Adds unsupported information
- Mixes correct and hallucinated facts
Evaluation idea:
Compare response against context—not just expected answer.
4. Answer Correctness
Question: Is the final answer actually correct?
This is what users see—but it’s the last layer.
Important:
Correct answers can still be poorly grounded, which is risky.
5. Hallucination Rate
Question: How often does the model generate unsupported information?
This is especially important in:
- Customer support
- Healthcare
- Finance
Track it explicitly—it won’t surface automatically.
A Practical Evaluation Flow
Here’s how you can structure RAG evaluation:
Input (Query)
↓
Retrieve Documents
↓
Evaluate Retrieval
↓
Generate Answer
↓
Evaluate Grounding + Correctness
Example Evaluation Loop
for sample in dataset:
docs = retriever.retrieve(sample["query"])
retrieval_score = evaluate_retrieval(docs, sample["expected_docs"])
answer = llm.generate(sample["query"], context=docs)
grounding_score = evaluate_grounding(answer, docs)
correctness_score = evaluate_answer(answer, sample["expected_answer"])
log({
"query": sample["query"],
"retrieval": retrieval_score,
"grounding": grounding_score,
"correctness": correctness_score
})
Real-World Failure Patterns
These show up again and again:
1. “Looks correct, but isn’t grounded”
- Answer sounds right
- Not supported by retrieved context
2. “Right data, wrong answer”
- Correct document retrieved
- Model misinterprets it
3. “No retrieval, full hallucination”
- Retriever fails
- Model still generates confident answer
4. “Too much context”
- Irrelevant documents dilute signal
- Model produces vague responses
Common Mistakes
- Evaluating only final answer
- Ignoring retrieval metrics
- Assuming RAG eliminates hallucinations
- Not separating retrieval vs generation failures
Practical Tips
- Start with a small, high-quality dataset
- Log retrieved documents for every query
- Evaluate components separately
- Track metrics over time (not just one run)
What’s Next
In the next part, I’ll go deeper into:
- Evaluating AI agents (multi-step workflows)
- Tracing and debugging agent behavior
- Measuring task success and failure modes
Final Thoughts
RAG doesn’t remove hallucinations—it changes where they come from.
If you only evaluate outputs, you’ll miss the real problem.
Reliable RAG systems come from:
- Strong retrieval
- Grounded generation
- Continuous evaluation
Because in RAG, the answer is only as good as the context behind it.
Top comments (0)