How I Built a Hallucination Detector for RAG Pipelines in Python

#python #machinelearning #rag #llm

How I Built a Hallucination Detector for RAG Pipelines in Python

Every developer who has shipped a RAG application knows this moment.

You retrieve the right documents. You pass them to the LLM. The response comes back confident, well-structured, and fluent. You ship it.

Then a user reports that the LLM cited a statistic that wasn't in any of your documents. Or named a person who doesn't exist. Or described a process that contradicts your source material — while sounding completely authoritative.

This is hallucination in a RAG pipeline. And it is surprisingly hard to catch systematically.

I built HallucinationBench to solve this in the simplest way possible.

The core idea

The approach is straightforward: use GPT-4o-mini as a structured judge.

Given a context (your retrieved documents) and a response (the LLM's output), the judge:

Breaks the response into individual factual claims
Classifies each claim as grounded (supported by context) or hallucinated (absent or contradicted)
Returns a faithfulness score and a verdict

That's it. No embeddings, no vector databases, no infrastructure. Just one API call.

Usage

from hallucinationbench import score

context = """
The Eiffel Tower is located in Paris, France. It was constructed between
1887 and 1889 as the entrance arch for the 1889 World's Fair.
The tower is 330 metres tall.
"""

response = """
The Eiffel Tower is in Paris. It was built in 1889 and stands 330 metres
tall. It was designed by Leonardo da Vinci and attracts over 7 million
visitors every year.
"""

result = score(context=context, response=response)
print(result)

Output:

Verdict          : FAIL
Faithfulness     : 0.40

Grounded claims  (2):
  ✓  The Eiffel Tower is in Paris.
  ✓  It stands 330 metres tall.

Hallucinated claims  (3):
  ✗  It was built in 1889.
  ✗  It was designed by Leonardo da Vinci.
  ✗  It attracts over 7 million visitors every year.

Two genuine hallucinations caught cleanly — Leonardo da Vinci (it was Gustave Eiffel) and the 7 million visitor figure (not in the context).

The result object

result.faithfulness_score     # float 0.0 – 1.0
result.grounded_claims        # list of supported statements
result.hallucinated_claims    # list of fabricated statements
result.verdict                # "PASS" | "WARN" | "FAIL"
result.model                  # judge model used

Verdict	Faithfulness Score
✅ PASS	>= 0.8
⚠️ WARN	>= 0.5
❌ FAIL	< 0.5

How the judge prompt works

The system prompt instructs GPT-4o-mini to return raw JSON only — no markdown, no explanation, no code fences. Just the structured object.

{
  "grounded_claims": ["claim 1", "claim 2"],
  "hallucinated_claims": ["claim A", "claim B"],
  "faithfulness_score": 0.75
}

Two design decisions that matter here:

response_format: json_object — This is OpenAI's native JSON mode. It guarantees the output is valid JSON on every call, eliminating parsing failures almost entirely.

temperature: 0 — Hallucination detection should be deterministic. The same context and response should always produce the same verdict. Temperature 0 enforces this.

Why GPT-4o-mini as the judge?

Three reasons:

Cost. Each evaluation costs approximately $0.001 — well under a tenth of a cent. You can run thousands of evaluations for a few dollars.

Speed. GPT-4o-mini is fast. A typical evaluation completes in under 2 seconds.

Accuracy. For claim-level classification against a provided context, GPT-4o-mini performs well. It is not doing open-domain knowledge retrieval — it is comparing claims against text you have already provided. That is a much simpler task.

Installation

pip install openai python-dotenv

Set your OpenAI API key:

# .env
OPENAI_API_KEY=your_key_here

Clone and run the Streamlit demo:

git clone https://github.com/bdeva1975/hallucinationbench.git
cd hallucinationbench
pip install -r requirements.txt
streamlit run app.py

What's next

The roadmap for v0.2.0:

Batch evaluation — score multiple context/response pairs in one call
CSV upload in the Streamlit app
Custom judge model — bring your own OpenAI model
LangChain and LlamaIndex integration hooks
CI/CD example — run hallucination checks as a GitHub Actions gate before deployment

The bigger picture

As RAG applications move from prototype to production, hallucination detection is becoming a non-negotiable quality gate — not an optional debugging tool.

Gartner predicts that by 2028, 60% of software engineering teams will use AI evaluation and observability platforms to build user trust in AI applications, up from just 18% in 2025.

HallucinationBench is the lightweight, zero-infrastructure entry point to that category. No account, no cloud, no dashboard required. Just pip install and two lines of code.

GitHub: https://github.com/bdeva1975/hallucinationbench

Star it if you find it useful. PRs and feedback welcome.

Built with Python, OpenAI GPT-4o-mini, and Streamlit.