Abhi Chatterjee

Posted on May 8

Evaluating RAG Systems: Measuring Retrieval Quality, Grounding, and Hallucinations

#ai #llm #rag #softwareengineering

Part 3 of a series on building reliable AI systems

In Part 1, we explored why testing AI systems is different.
In Part 2, we built evaluation pipelines.

Now let’s focus on one of the most widely used (and misunderstood) patterns:

Retrieval-Augmented Generation (RAG).

RAG is often seen as a solution to hallucinations.

In reality, it just shifts the problem.

The Core Problem with RAG

A typical RAG pipeline looks like this:

User Query
    ↓
Retriever → Context
    ↓
LLM → Response

When something goes wrong, it’s not always obvious where the failure is.

Did retrieval fail?
Was the context irrelevant?
Did the model ignore the context?
Or did it hallucinate anyway?

Without proper evaluation, everything looks like a “model problem.”

RAG Has Two Systems, Not One

This is the key insight:

You are not evaluating a single system—you are evaluating two tightly coupled systems.

Retriever (search problem)
Generator (language problem)

If you don’t evaluate them separately, debugging becomes guesswork.

What Should You Measure?

To evaluate RAG properly, you need to break it into components.

1. Retrieval Quality

Question: Did we fetch the right information?

Metrics to consider:

Top-K relevance
Context recall (was the correct doc retrieved?)
Ranking quality

Example failure:
The correct document exists—but wasn’t retrieved.

No model can fix missing context.

2. Context Relevance

Question: Is the retrieved content actually useful?

Even if retrieval “works,” the context may be:

Noisy
Partially relevant
Outdated

This leads to weak or incorrect answers.

3. Grounding / Faithfulness

Question: Did the model use the retrieved context?

This is one of the most critical checks.

Failure patterns:

Model ignores context
Adds unsupported information
Mixes correct and hallucinated facts

Evaluation idea:
Compare response against context—not just expected answer.

4. Answer Correctness

Question: Is the final answer actually correct?

This is what users see—but it’s the last layer.

Important:
Correct answers can still be poorly grounded, which is risky.

5. Hallucination Rate

Question: How often does the model generate unsupported information?

This is especially important in:

Customer support
Healthcare
Finance

Track it explicitly—it won’t surface automatically.

A Practical Evaluation Flow

Here’s how you can structure RAG evaluation:

Input (Query)
   ↓
Retrieve Documents
   ↓
Evaluate Retrieval
   ↓
Generate Answer
   ↓
Evaluate Grounding + Correctness

Example Evaluation Loop

for sample in dataset:
    docs = retriever.retrieve(sample["query"])

    retrieval_score = evaluate_retrieval(docs, sample["expected_docs"])

    answer = llm.generate(sample["query"], context=docs)

    grounding_score = evaluate_grounding(answer, docs)
    correctness_score = evaluate_answer(answer, sample["expected_answer"])

    log({
        "query": sample["query"],
        "retrieval": retrieval_score,
        "grounding": grounding_score,
        "correctness": correctness_score
    })

Real-World Failure Patterns

These show up again and again:

1. “Looks correct, but isn’t grounded”

Answer sounds right
Not supported by retrieved context

2. “Right data, wrong answer”

Correct document retrieved
Model misinterprets it

3. “No retrieval, full hallucination”

Retriever fails
Model still generates confident answer

4. “Too much context”

Irrelevant documents dilute signal
Model produces vague responses

Common Mistakes

Evaluating only final answer
Ignoring retrieval metrics
Assuming RAG eliminates hallucinations
Not separating retrieval vs generation failures

Practical Tips

Start with a small, high-quality dataset
Log retrieved documents for every query
Evaluate components separately
Track metrics over time (not just one run)

What’s Next

In the next part, I’ll go deeper into:

Evaluating AI agents (multi-step workflows)
Tracing and debugging agent behavior
Measuring task success and failure modes

Final Thoughts

RAG doesn’t remove hallucinations—it changes where they come from.

If you only evaluate outputs, you’ll miss the real problem.

Reliable RAG systems come from:

Strong retrieval
Grounded generation
Continuous evaluation

Because in RAG, the answer is only as good as the context behind it.

DEV Community