Faizal

Posted on Jun 11

RAG-Based Testing Series — Part 3: Faithfulness & Hallucination Detection

#testing #ai #rag #python

RAG-Based Testing Series — Part 3: Faithfulness & Hallucination Detection

"The scariest bug in software is the one that looks correct."

In Part 2, we tested retrieval quality.

We wrote real tests. We calculated Precision@K, Recall@K, and MRR. We built assertions that fail loudly when the wrong documents are fetched.

And let's say your retrieval is now solid. ✅

The right documents are being fetched. Scores are green. The context flowing into your LLM is accurate and relevant.

You're done, right?

Wrong. 😬

Because here's the problem nobody tells you about until it bites them in production 👇

🔴 The Problem With a "Perfect" Retriever

You can have flawless retrieval and still get completely wrong answers.

How?

Because once the retrieved context reaches the LLM — the LLM decides what to do with it.

And sometimes, the LLM doesn't use it.

Sometimes it ignores the context entirely and answers from its own training data — which could be outdated, biased, or just plain wrong.

Sometimes it uses part of the context and silently fills in the gaps with invention.

Sometimes it confidently states something that contradicts the very document it was given.

This is hallucination — and it lives in the generation layer, not the retrieval layer.

Two completely separate failure modes. Two separate testing strategies. 🎯

🧠 Two Types of Hallucination in RAG Systems

Before we test, we need to understand what we're actually dealing with.

Type 1 — Intrinsic Hallucination

The LLM generates an answer that directly contradicts the retrieved context.

Retrieved context: "Refunds are available within 30 days of purchase."
LLM answer:        "You have 60 days to request a refund."

The context was there. The LLM ignored it. 🔴

This is the most dangerous type — because the correct answer was available, and the system chose not to use it.

Type 2 — Extrinsic Hallucination

The LLM generates information that isn't present in the retrieved context at all — it neither confirms nor contradicts it. The LLM just invented it.

Retrieved context: "Premium subscribers can contact support via the portal."
LLM answer:        "Premium subscribers can contact support via the portal or by calling 1-800-555-0100."

Where did that phone number come from? 👻 Not from any document.

This is subtler and harder to catch — because the answer isn't necessarily wrong, it's just unverifiable from the context. And in production, unverifiable often becomes wrong.

📐 The Metric That Captures This: Faithfulness

In RAG evaluation, we measure this with a metric called Faithfulness.

Faithfulness measures whether every claim in the generated answer can be traced back to the retrieved context.

Score ranges from 0 to 1.

1.0 = Every single statement in the answer is grounded in the retrieved context ✅
0.0 = The answer is entirely fabricated — no connection to the context at all 🔴
0.6 = Some statements are grounded, others are invented — a mixed, untrustworthy answer ⚠️

The formula conceptually:

Faithfulness = (Claims in answer supported by context) / (Total claims in answer)

In practice, computing this manually is impossible at scale. That's where RAGAS comes in. 👇

🛠️ Let's Build Hallucination Detection Tests

We're going to use RAGAS — the leading Python framework for RAG evaluation — to measure faithfulness automatically.

Setup

pip install ragas
pip install openai
pip install datasets

You'll need an OpenAI API key — RAGAS uses an LLM under the hood to evaluate faithfulness. (We'll talk about what that means in a moment.)

Step 1 — Understand What RAGAS Needs

RAGAS faithfulness evaluation needs three things for each test case:

1. question      — the user's original query
2. answer        — the LLM's generated response
3. contexts      — the list of retrieved document chunks used to generate the answer

That's it. No ground truth answer required for faithfulness. ✅

This is important — because in production, you often don't have a "correct" answer to compare against. Faithfulness only checks whether the answer is grounded in what was retrieved.

Step 2 — Build Your Test Dataset

from datasets import Dataset

# Each entry = one RAG interaction to evaluate
faithfulness_test_data = [
    {
        "question": "What is the refund policy for premium subscribers?",
        "answer": "Premium subscribers can request a full refund within 30 days of purchase through the support portal.",
        "contexts": [
            "Premium subscribers are eligible for a full refund within 30 days of purchase. Requests must be submitted via the support portal."
        ]
    },
    {
        "question": "What is the refund policy for premium subscribers?",
        # Hallucinated answer — says 60 days, context says 30
        "answer": "Premium subscribers are entitled to a full refund within 60 days. They can also call our support line directly.",
        "contexts": [
            "Premium subscribers are eligible for a full refund within 30 days of purchase. Requests must be submitted via the support portal."
        ]
    },
    {
        "question": "How do I reset my password?",
        "answer": "Click the Forgot Password link on the login page. A reset link will be sent to your registered email address.",
        "contexts": [
            "To reset your password, click the Forgot Password link on the login page. A reset link will be sent to your registered email."
        ]
    },
    {
        "question": "How do I reset my password?",
        # Extrinsic hallucination — invented the SMS option
        "answer": "Click the Forgot Password link on the login page. You can also reset via SMS if you have your phone number registered.",
        "contexts": [
            "To reset your password, click the Forgot Password link on the login page. A reset link will be sent to your registered email."
        ]
    },
]

dataset = Dataset.from_list(faithfulness_test_data)

Notice we have two versions of some questions — one faithful answer and one hallucinated answer. This lets us verify our detection is working correctly. If RAGAS scores the hallucinated answers low and the faithful answers high, we know the evaluation is doing its job. 🎯

Step 3 — Run RAGAS Faithfulness Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Set up the LLM and embeddings for RAGAS
llm = ChatOpenAI(model="gpt-4o-mini", openai_api_key="your-openai-api-key")
embeddings = OpenAIEmbeddings(openai_api_key="your-openai-api-key")

# Run evaluation
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness],
    llm=llm,
    embeddings=embeddings
)

print(results)

Sample output:

{'faithfulness': 0.625}

That's the average across all test cases. But we need per-case scores for proper assertions. Here's how to get them 👇

import pandas as pd

# Convert to DataFrame for per-case analysis
df = results.to_pandas()
print(df[["question", "answer", "faithfulness"]])

Output:

question                                    answer                                              faithfulness
What is the refund policy...                Premium subscribers can request a full refund...    1.0
What is the refund policy...                Premium subscribers are entitled to a full refund   0.3
How do I reset my password?                 Click the Forgot Password link...                   1.0
How do I reset my password?                 Click the Forgot Password link... also via SMS      0.5

The hallucinated answers score significantly lower. ✅

Step 4 — Add Pytest Assertions

import pytest

FAITHFULNESS_THRESHOLD = 0.8  # answers must be at least 80% grounded in context


def test_faithfulness_scores():
    results = evaluate(
        dataset=dataset,
        metrics=[faithfulness],
        llm=llm,
        embeddings=embeddings
    )

    df = results.to_pandas()

    failures = []
    for _, row in df.iterrows():
        if row["faithfulness"] < FAITHFULNESS_THRESHOLD:
            failures.append({
                "question": row["question"],
                "answer": row["answer"],
                "score": row["faithfulness"]
            })

    if failures:
        failure_report = "\n\n".join([
            f"❌ HALLUCINATION DETECTED\n"
            f"   Question: {f['question']}\n"
            f"   Answer:   {f['answer']}\n"
            f"   Score:    {f['score']} (threshold: {FAITHFULNESS_THRESHOLD})"
            for f in failures
        ])
        pytest.fail(f"\n\n{len(failures)} faithfulness failure(s) detected:\n\n{failure_report}")


def test_no_critical_hallucinations():
    """
    Critical test — any answer scoring below 0.3 is an outright fabrication.
    This should never happen in a production system.
    """
    CRITICAL_THRESHOLD = 0.3

    results = evaluate(
        dataset=dataset,
        metrics=[faithfulness],
        llm=llm,
        embeddings=embeddings
    )

    df = results.to_pandas()
    critical_failures = df[df["faithfulness"] < CRITICAL_THRESHOLD]

    assert len(critical_failures) == 0, (
        f"\n🚨 CRITICAL HALLUCINATION(S) DETECTED — answers are almost entirely fabricated:\n"
        f"{critical_failures[['question', 'answer', 'faithfulness']].to_string()}"
    )

🤖 Wait — How Does RAGAS Actually Measure Faithfulness?

This is the question I always get. And it's a good one. 👇

RAGAS uses an LLM-as-judge approach.

Here's what happens under the hood:

Step 1 — RAGAS breaks the generated answer into individual claims
         "You have 60 days to request a refund"
         "You can call our support line directly"

Step 2 — For each claim, RAGAS asks the evaluation LLM:
         "Can this claim be inferred from the retrieved context? YES or NO"

Step 3 — Faithfulness score = supported claims / total claims

This is why RAGAS needs an LLM (like GPT-4o-mini) to run evaluations — the LLM is the judge.

Which raises an obvious question 👇

🤔 Can You Trust an LLM to Evaluate Another LLM?

Honestly? It's imperfect. But it's the best tool we have right now.

Here's the nuanced reality:

LLM-as-judge works well when:

Claims are factual and specific ("30 days" vs "60 days")
The context is clear and unambiguous
You're evaluating semantic similarity, not deep reasoning

LLM-as-judge struggles when:

The context itself is ambiguous
The claim requires multi-step reasoning to verify
The evaluation LLM has biases toward certain writing styles

The practical answer:

Use LLM-as-judge as your automated gate in CI/CD — it catches the obvious hallucinations and regressions. Then add human review for edge cases, low-confidence scores (0.4–0.7 range), and any new category of query you haven't seen before.

Automated + human review is the production-grade approach. Not one or the other. 🎯

📊 Faithfulness vs Answer Relevance — Don't Confuse Them

While we're here, let me clarify something that trips up a lot of people.

RAGAS has two metrics that sound similar but measure completely different things:

Metric	Question It Answers	What Can Fail It
Faithfulness	Is the answer grounded in the retrieved context?	LLM ignoring or contradicting context
Answer Relevance	Is the answer actually addressing the question?	LLM going off-topic, giving irrelevant info

You can have high faithfulness but low relevance — the answer is perfectly grounded in context, but it's answering the wrong question.

You can have high relevance but low faithfulness — the answer addresses the question perfectly, but it's made up.

You need both. Here's how to measure them together 👇

from ragas.metrics import faithfulness, answer_relevancy

results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy],
    llm=llm,
    embeddings=embeddings
)

df = results.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy"]])

🔴 Real Hallucination Patterns You'll See in Production

These are the patterns I've encountered in real RAG systems — not theoretical edge cases:

Pattern 1 — The Confident Extrapolator

The LLM takes a true fact and confidently extends it beyond what the context supports.

Context:  "Express shipping is available."
Answer:   "Express shipping is available and typically delivers within 1-2 business days."

The timeframe was never in the context. The LLM invented a plausible detail. 👻

Why it's dangerous: The answer sounds authoritative. Users trust it. If the actual timeframe is 3-5 days — you have a customer expectation problem.

Pattern 2 — The Outdated Memory Override

The LLM's training data contradicts the retrieved context — and the LLM trusts its training over the context.

Context:  "As of Q1 2025, the premium plan costs $29/month."
Answer:   "The premium plan costs $19/month."

The LLM "remembered" the old price from its training data. 🔴

Why it's dangerous: This is the RAG failure mode that defeats the entire purpose of having a RAG system. And it's more common than you'd think with older LLMs.

Pattern 3 — The Synthesis Fabricator

Multiple documents are retrieved. The LLM combines them in a way that creates a new "fact" that wasn't in any individual document.

Doc 1: "Premium subscribers get priority support."
Doc 2: "Our support team is available 24/7."
Answer: "Premium subscribers get 24/7 priority support with a dedicated account manager."

The dedicated account manager came from nowhere. 👻

Why it's dangerous: Each individual document is accurate. The synthesis is not. Very hard to catch with simple checks.

🧩 The Full Picture So Far

Let's zoom out and see where we are in the testing stack:

Layer 1 — RETRIEVAL QUALITY (Part 2) ✅
          Are the right documents being fetched?
          → Precision@K, Recall@K, MRR, NDCG

Layer 2 — FAITHFULNESS & HALLUCINATION (Part 3) ✅  ← You are here
          Is the answer grounded in what was retrieved?
          → Faithfulness score, Answer Relevancy score

Layer 3 — EDGE CASES (Part 4) ← Up next
          What happens when things go wrong by design?
          → No relevant docs, conflicting context, adversarial queries

Layer 4 — FULL FRAMEWORK (Part 5)
          All layers combined into one runnable test suite

Layer 5 — CI/CD AUTOMATION (Part 6)
          Running automatically on every change

Each layer builds on the previous one. Skip any layer — and you have blind spots. 🎯

🔖 Key Takeaways From Part 3

Good retrieval doesn't guarantee a faithful answer — the LLM can still hallucinate even with perfect context
Hallucination has two forms — intrinsic (contradicts context) and extrinsic (invents beyond context)
Faithfulness is measurable — RAGAS gives you a 0–1 score per answer using LLM-as-judge
LLM-as-judge is imperfect but practical — use it as your automated gate, add human review for edge cases
Faithfulness ≠ Answer Relevance — measure both, they catch different failure modes
Set hard thresholds and fail loudly — a faithfulness score below 0.3 is a production incident waiting to happen

🚀 What's Next

In Part 4, we intentionally break things.

We've now tested the happy path — good retrieval, faithful answers.

But real users don't stay on the happy path. They ask questions your knowledge base can't answer. They send contradictory queries. They push the system into territory it was never designed for.

Edge cases are where RAG systems fail in ways you didn't predict.

Part 4 covers:

What happens when there are zero relevant documents
Conflicting context — two documents saying different things
Adversarial queries designed to trigger hallucinations
How to test all of these systematically

Part 1 — What Is RAG & Why It Needs Different Testing       ✅ Done
Part 2 — Testing Retrieval Quality: Are You Fetching Right? ✅ Done
Part 3 — Faithfulness & Hallucination Detection             ← You are here
Part 4 — Edge Cases: What Breaks RAG & How to Catch It      ← Up next
Part 5 — Building a RAG Test Framework from Scratch
Part 6 — Automating RAG Quality Checks in CI/CD

Follow me so you don't miss Part 4 — edge case testing is where QA instincts really shine. This is the part where 7.5 years of breaking things on purpose finally pays off. 😄

Drop a comment below 👇

Have you encountered hallucinations in a RAG system you've tested or built?
Which pattern surprised you most — intrinsic, extrinsic, or the synthesis fabricator?
Any questions about RAGAS or LLM-as-judge before we move to Part 4?

All levels welcome. Let's learn this together. 🙌

Faizal Shaikh | Senior Automation Engineer | AI & RAG-Based Testing
Connect with me on LinkedIn

DEV Community

RAG-Based Testing Series — Part 3: Faithfulness & Hallucination Detection

RAG-Based Testing Series — Part 3: Faithfulness & Hallucination Detection

🔴 The Problem With a "Perfect" Retriever

🧠 Two Types of Hallucination in RAG Systems

Type 1 — Intrinsic Hallucination

Type 2 — Extrinsic Hallucination

📐 The Metric That Captures This: Faithfulness

🛠️ Let's Build Hallucination Detection Tests

Setup

Step 1 — Understand What RAGAS Needs

Step 2 — Build Your Test Dataset

Step 3 — Run RAGAS Faithfulness Evaluation

Step 4 — Add Pytest Assertions

🤖 Wait — How Does RAGAS Actually Measure Faithfulness?

🤔 Can You Trust an LLM to Evaluate Another LLM?

📊 Faithfulness vs Answer Relevance — Don't Confuse Them

🔴 Real Hallucination Patterns You'll See in Production

Pattern 1 — The Confident Extrapolator

Pattern 2 — The Outdated Memory Override

Pattern 3 — The Synthesis Fabricator

🧩 The Full Picture So Far

🔖 Key Takeaways From Part 3

🚀 What's Next

Top comments (0)