Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

#ai #agents #observability #evaluation

Every team I talk to says their agent "sometimes hallucinates," and almost none of them can tell me how often. That gap — between knowing it happens and being able to count it — is the whole problem. You cannot fix, gate, or even trend a failure mode you only detect by feel.

Here is the opinion I will defend: hallucination detection is not a model-quality problem, it's an instrumentation problem. The reason you can't measure it is that you threw away the evidence the moment the agent finished running. Detecting an ungrounded claim requires knowing what the agent was allowed to claim, and that lives in the tool outputs and retrieved context, not in the final answer string. If you don't capture those, every hallucination check you write is guessing.

Let me break down what hallucination actually is in an agentic system, why the popular detection methods miss the common case, and how to wire up a number you can put in CI.

"Hallucination" is three different bugs wearing one coat

The word is overloaded, and the overloading is why detection efforts flail. In a tool-using agent, there are at least three distinct failures people lump together:

Parametric leakage. The agent answers from training-data memory instead of the tool result it was given. The answer might even be correct — but it's correct by luck, not because it used the data you grounded it on. Tomorrow the same code path produces a confidently wrong answer and you have no idea why.
Fabricated grounding. The agent cites a source, a record ID, a field, or a number that does not appear anywhere in its retrieved context. This is the dangerous one because it looks grounded. It has the shape of a sourced claim.
Unsupported synthesis. Every individual fact is present in the context, but the agent combined them into a conclusion the source never makes. No single token is fabricated; the inference is.

These need different detectors. Lumping them under one "hallucination score" gives you a number nobody trusts, because it conflates a lucky-but-ungrounded answer with an invented customer ID. The first move toward measuring hallucination is refusing to treat it as one metric.

Why "ask the model if it hallucinated" is the weakest option

The most common detection approach is to hand the output back to an LLM and ask "is this faithful to the context?" It's appealing because it's one API call. It's also the method most likely to wave through the exact failures you care about.

The self-consistency variant — sample the answer five times, flag disagreement — catches unstable hallucinations but misses stable ones. If the agent reliably leaks the same wrong fact from parametric memory every time, all five samples agree and your detector reports high confidence. The model is reproducibly wrong, and consistency was your signal. That's not a corner case; it's the most common production hallucination there is.

Model-as-judge faithfulness scoring is genuinely useful — but only for unsupported synthesis, the fuzzy case where you actually need judgment. For the other two, you don't need an LLM at all. You need set membership. And a deterministic check that you can fully explain beats a 0.7-from-a-judge that you can't, every time.

Grounding is checkable when you keep the grounding

Here's the core technique, and it's almost embarrassingly mechanical: extract the verifiable claims from the output, and check each one against the actual text the agent retrieved. The catch — the entire reason this is hard in practice — is that "the actual text the agent retrieved" has usually evaporated by the time you want to check.

This is exactly why I treat tracing and evaluation as one workflow rather than two tools. AgentLens captures the execution trace: every tool call with its raw output, the resolved context that actually went into the model, the final answer — the full ground-truth record of what the agent had access to. agent-eval is the other half: it takes that trace plus the output and runs the grounding checks, returning a pass/fail verdict you can gate a build on. The pairing is the point. agent-eval can only check a claim against the source if AgentLens kept the source. A faithfulness scorer with no trace behind it is reduced to asking a model to vibe-check itself — which is where we came in.

Here's what a layered detector looks like over a captured trace:

import { getTrace } from "agentlens";
import { defineScorer } from "agent-eval";

// Pull the agent's actual evidence out of the trace: every tool result
// and the resolved retrieval context the model was actually shown.
function collectGrounding(trace: Awaited<ReturnType<typeof getTrace>>): string {
  return trace.steps
    .filter((s) => s.kind === "tool" || s.kind === "retrieval")
    .map((s) => JSON.stringify(s.output))
    .join("\n");
}

// Detector 1 (deterministic): fabricated grounding.
// Any structured reference the agent emits MUST appear in the evidence.
// Catches invented record IDs, citation keys, dollar amounts.
const noFabricatedRefs = defineScorer({
  name: "no_fabricated_refs",
  async score({ output, runId }) {
    const evidence = collectGrounding(await getTrace(runId));

    // Reference shapes this agent is allowed to cite.
    const patterns = [/CUST-\d{5}/g, /DOC-[a-f0-9]{8}/g, /\$[\d,]+\.\d{2}/g];
    const claimed = patterns.flatMap((p) => [...output.matchAll(p)].map((m) => m[0]));

    const fabricated = claimed.filter((ref) => !evidence.includes(ref));
    return {
      pass: fabricated.length === 0,
      value: fabricated.length,
      detail: fabricated.length ? `ungrounded: ${fabricated.join(", ")}` : "ok",
    };
  },
});

// Detector 2 (judge): unsupported synthesis.
// The fuzzy case — every fact is present but the CONCLUSION isn't supported.
// This is the only layer that needs a model, and it needs the real evidence.
const faithfulSynthesis = defineScorer({
  name: "faithful_synthesis",
  async score({ output, runId }) {
    const evidence = collectGrounding(await getTrace(runId));
    const verdict = await judge({
      system: "Return supported=false if any claim is not entailed by EVIDENCE. " +
              "Correct-but-absent-from-evidence counts as NOT supported.",
      evidence,
      claim: output,
    });
    return { pass: verdict.supported, value: verdict.confidence, detail: verdict.reason };
  },
});

Two design decisions in there carry the whole thing, and I'll defend both.

The deterministic detector runs first and is the one I trust most. Fabricated reference IDs and invented dollar amounts are not a matter of judgment — a claimed ID either appears in the tool output or it doesn't. That's a String.includes, not a 9.1-from-a-judge. It never flakes, costs nothing, and when it fails it hands you the exact ungrounded token. Most of your scary, customer-visible hallucinations are this category, and they're catchable without an LLM in the loop.

The judge instruction explicitly defines correct-but-ungrounded as a failure. This is the line that catches parametric leakage. A naive faithfulness prompt rewards correct answers, so a lucky memory-leak passes. By forcing "absent from evidence = not supported," you separate grounded from merely right — which is the distinction that actually predicts whether the agent will be wrong tomorrow when its luck runs out.

A hallucination rate is a trend, not a verdict

One run telling you "this output was grounded" is nearly worthless, because hallucination is a property of the distribution, not of a single answer. The number that matters is the rate — what fraction of production runs emit an ungrounded claim — and its slope over time.

This is where keeping the trace pays off a second time. Because every AgentLens trace carries the evidence inline, you can re-run these detectors across a window of historical production traffic without re-invoking the agent, and watch the rate move:

import { queryTraces } from "agentlens";
import { runScorers } from "agent-eval";

async function hallucinationRate(sinceHours: number): Promise<number> {
  const traces = await queryTraces({ sinceHours, hasOutput: true });
  const reports = await Promise.all(
    traces.map((t) => runScorers([noFabricatedRefs, faithfulSynthesis], { runId: t.id })),
  );
  const flagged = reports.filter((r) => !r.passed).length;
  return flagged / reports.length; // e.g. 0.031 == 3.1% of runs ungrounded
}

Now "the agent sometimes hallucinates" becomes "3.1% of runs last week emitted an ungrounded claim, up from 1.8% — here are the trace IDs." That's a number you can put on a dashboard, gate a release on, and hand to a skeptic. The eval gives you the rate; the trace behind each flagged run gives you the specific tool output the claim should have come from and didn't. You stop arguing about whether hallucination is a problem and start clicking into the step where it happened.

The takeaway

Stop treating hallucination as an inherent, unmeasurable property of language models and start treating it as a grounding check you forgot to instrument. Split it into its three real failure modes. Catch fabricated references and parametric leakage with deterministic set-membership checks — no judge required. Reserve model-as-judge for the genuinely fuzzy synthesis case. And capture the trace, because every one of these checks is impossible without the evidence the agent actually saw.

The agents hallucinate at a specific, knowable rate. The only reason you don't know yours is that you let the evidence disappear. Capture the path with AgentLens, score the grounding with agent-eval, and the vibe becomes a number — which is the only form of the problem you can actually fix.