4 Types of Hallucinations: One Detection Pattern Per Type

#ai #llm #observability #eval

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A customer pasted three sentences from the assistant into a ticket. The first cites a paper from 2024 that does not exist. The second sentence contradicts the third. None of them appear anywhere in the document the user actually uploaded.

If you are running a single hallucination check against that paragraph, you will catch one of those three problems and miss the other two. They are not the same defect. They come from different failure modes and need different detectors. Treating "hallucination" as one bucket is why your eval suite passes while support escalates.

The Ji et al. survey on hallucination in natural language generation splits the problem into intrinsic and extrinsic. SelfCheckGPT (Manakul et al., 2023) showed that sampled responses diverge for hallucinated facts and converge for grounded ones, and TruthfulQA (Lin, Hilton, Evans, 2021) isolates the factual-mimicry failure on its own. Each maps to a different shape of error in production.

Four shapes, one harness — Python below.

Type 1: factual fabrication

The model invents an entity: a date, a citation, a statute, a function signature. The string is well-formed and the syntax is correct, but the referent does not exist.

You catch this by grounding extracted entities against an authoritative source. The detector just needs to be a lookup against a source you trust.

# detectors/factual.py
import re
from typing import Callable

CITATION_RE = re.compile(
    r"\b([A-Z][a-z]+(?:\s+et\s+al\.)?)[,\s]+\(?(\d{4})\)?"
)

def factual_fabrication(
    text: str,
    lookup: Callable[[str, int], bool],
) -> list[dict]:
    """Return list of unverified citations.
    `lookup(author, year) -> True if found in trusted source.`"""
    findings = []
    for match in CITATION_RE.finditer(text):
        author, year = match.group(1), int(match.group(2))
        if not lookup(author, year):
            findings.append({
                "kind": "factual_fabrication",
                "span": match.group(0),
                "reason": f"no record of {author} ({year})",
            })
    return findings

The regex is illustrative; for entities that matter (drug names, ICD codes, internal user IDs, ticker symbols), swap it for a domain extractor and a real index. The point is the contract: extract candidates, look them up in something you trust, fail loudly when the lookup misses.

This is the type the customer noticed first. It is also the easiest to catch.

Type 2: intrinsic contradiction

The output disagrees with itself. Sentence three negates sentence one. The first paragraph says the patient is allergic to penicillin; the third paragraph recommends amoxicillin.

This one cannot be caught by lookup. It can be caught by sampling, which is the move SelfCheckGPT formalized. If a claim is grounded, repeated samples agree. Invented claims drift apart across samples. You generate k responses to the same prompt and measure pairwise agreement.

# detectors/contradiction.py
from itertools import combinations

def intrinsic_contradiction(
    samples: list[str],
    nli: Callable[[str, str], str],  # entail|neutral|contradict
) -> dict:
    """Pairwise NLI across k samples. Flag if any pair contradicts."""
    contradictions = 0
    pairs = list(combinations(range(len(samples)), 2))
    for i, j in pairs:
        if nli(samples[i], samples[j]) == "contradict":
            contradictions += 1
    rate = contradictions / max(1, len(pairs))
    return {
        "kind": "intrinsic_contradiction",
        "contradiction_rate": rate,
        "flagged": rate > 0.0,
    }

The nli callable is a natural-language-inference classifier. A small fine-tuned model works (a small NLI model such as DeBERTa-MNLI is cheap enough to run on CPU). Calling a second LLM with a strict yes/no prompt also works. It needs no model hosting, which is why most teams ship it first.

The aggregate signal is the contradiction rate across pairs. Treat one contradicting pair out of ten as a pointer for a deeper look. Don't make it a kill switch.

Type 3: prompt-vs-output divergence

The output ignores what the user gave you. The user uploaded a contract dated 2018 and asked for a summary; the assistant summarized a different contract. Or the user pastes error logs and the response answers a question they didn't ask.

This is the type that is easiest to test for and the most often skipped. The test is: does the output stay faithful to the input? Run NLI in one direction: does the input entail the output's claims about the input? Flag anything that drifts.

# detectors/divergence.py
def prompt_output_divergence(
    user_input: str,
    output: str,
    nli: Callable[[str, str], str],
    splitter: Callable[[str], list[str]],
) -> list[dict]:
    """For each output sentence that asserts something about
    the input, check input entails it."""
    findings = []
    for sentence in splitter(output):
        verdict = nli(user_input, sentence)
        if verdict == "contradict":
            findings.append({
                "kind": "prompt_output_divergence",
                "span": sentence,
                "reason": "input contradicts output sentence",
            })
    return findings

Two practical notes on the splitter and on neutral verdicts. The splitter matters: a regex on . is fine for prose, terrible for code blocks and lists. Use nltk.sent_tokenize or equivalent. And "neutral" is not a fail: an output sentence that adds a generic disclaimer is not a divergence, just an addition. Only contradict is a hard signal.

Type 4: tool-call hallucination (the short version)

Strict schemas guarantee the shape of a tool call. They guarantee nothing about the values. A model can emit delete_user(user_id="usr_4f9...") against a user who never existed and the schema will be happy.

The detection pattern is a runtime existence check before the side effect runs.

# detectors/tool_call.py
def tool_call_hallucination(
    tool_name: str,
    args: dict,
    resolver: Callable[[str, dict], dict | None],
) -> dict | None:
    """`resolver` looks up the referenced entity in your DB.
    Returns None if the referenced entity exists; finding if not."""
    resolved = resolver(tool_name, args)
    if resolved is None:
        return {
            "kind": "tool_call_hallucination",
            "tool": tool_name,
            "args": args,
            "reason": "referenced entity does not exist",
        }
    return None

That is the whole pattern. Schema-validate the call, then validate the values against state, then run the side effect.

The harness, end to end

The file below combines all four detectors and expects four pluggable callables (citation_lookup, nli, splitter, tool_resolver) so you bring your own backends.

# halluharness.py
from dataclasses import dataclass, field
from typing import Callable, Optional

from detectors.factual import factual_fabrication
from detectors.contradiction import intrinsic_contradiction
from detectors.divergence import prompt_output_divergence
from detectors.tool_call import tool_call_hallucination

@dataclass
class Run:
    user_input: str
    output: str
    samples: list[str]
    tool_calls: list[dict] = field(default_factory=list)

@dataclass
class Verdict:
    findings: list[dict]
    score: float        # 1.0 clean, 0.0 fully hallucinated
    blocked: bool

def check(
    run: Run,
    citation_lookup: Callable[[str, int], bool],
    nli: Callable[[str, str], str],
    splitter: Callable[[str], list[str]],
    tool_resolver: Callable[[str, dict], Optional[dict]],
    block_on: tuple[str, ...] = (
        "factual_fabrication",
        "tool_call_hallucination",
    ),
) -> Verdict:
    findings: list[dict] = []
    findings.extend(
        factual_fabrication(run.output, citation_lookup)
    )
    contra = intrinsic_contradiction(run.samples, nli)
    if contra["flagged"]:
        findings.append(contra)
    findings.extend(
        prompt_output_divergence(
            run.user_input, run.output, nli, splitter
        )
    )
    for call in run.tool_calls:
        f = tool_call_hallucination(
            call["name"], call["args"], tool_resolver
        )
        if f is not None:
            findings.append(f)
    n = len(findings)
    score = 1.0 if not n else max(0.0, 1.0 - 0.25 * n)
    blocked = any(f["kind"] in block_on for f in findings)
    return Verdict(findings=findings, score=score, blocked=blocked)

Compared to a single end-to-end "is this a hallucination" prompt, the harness wins on three counts that matter in production:

Each detector stays separate. When a check fires, you know which one and why; the finding has a kind and a reason. The on-call gets a span and a reason instead of a vibe.

Samples are a first-class input. Self-consistency is not optional for the contradiction check. If your eval pipeline only generates one response per case, the contradiction detector cannot see anything; you have to wire k > 1 sampling at the generator step. The same idea I wrote about for stochastic judges applies on the generator side: one sample is one observation, not the truth.

Block and report are different defaults. Factual fabrication and tool-call hallucination both have hard ground truth (the lookup either matches or it does not), so they default to blocking. Contradiction and divergence depend on a probabilistic NLI verdict; log them and route for human review. You will tune those defaults per surface; a customer-facing chat answer and a backend agent action have different risk budgets. The 0.25-per-finding penalty in score is illustrative; in production you weight by kind and severity.

What to do with this on Monday

Pick the type that hurts you the most right now. If your assistant cites things that do not exist, ship the factual detector first; the lookup is the load-bearing piece, not the regex. If your agents touch databases, ship the tool-call resolver first and gate side effects behind it. If your RAG answers contradict the document, the divergence check is what to ship first.

Then add the other three. Run them all on every output. Log findings to your tracing layer alongside the trace ID. The next time a customer pastes three sentences from a broken response, you will know which of the four types fired, on which span, with which reason. Next time it happens, you ship a fix to the right detector, not a new prompt.

If this was useful

The LLM Observability Pocket Guide covers how to wire detectors like these into the eval and tracing tools that already live in your stack: where to put the checks (online vs. offline), how to sample for self-consistency without doubling your inference bill, and what to alert on. It also walks through threading per-finding rationales through OpenTelemetry spans, so the on-call gets a span and a reason instead of a one-line "hallucination=true" flag.