Detecting the hard case of LLM hallucination from generation dynamics, and why magnitude beats direction.
TL;DR
- The hard case in hallucination detection is confident confabulation: plausible, fluent, wrong, and produced with no hesitation. Methods that key on the model "sounding unsure" are weakest exactly here.
- Across ~124 prompts, the mean internal response to confident confabulation is statistically indistinguishable from truth. The model does not move in a consistent "lying direction."
- What separates the two is magnitude and variance: confabulation produces larger, more dispersed swings in the model's internal trajectory. The variance ratio between confabulation and truth is roughly 7× on the representational-shift channel (Cohen's d ≈ 0.58, p ≈ 0.005).
- The variability scales with fabrication intensity (a dose-response), which is the strongest evidence that this is a property of confabulation and not noise.
- Practical upshot: detect instability, not a direction; integrate the signal over the generated span; and couple the detector to an intervention rather than using it as a standalone gate.
The hard case
It is by now well established that a model's internal states carry information about whether its output is true: the line of work running from Azaria & Mitchell's "the internal state of an LLM knows when it's lying" through to more recent results showing that truthfulness is encoded in activations and that models often "know more than they show." It's also become standard to separate confabulation (arbitrary, plausible, confidently-wrong generation) from the broader grab-bag of "hallucination," following Farquhar et al.'s Nature work on semantic entropy.
The uncomfortable subcase is confident confabulation. Uncertainty- and dispersion-based detectors work well when the model is visibly unsure. But the failure that actually burns people in production (a fabricated citation, a confidently invented dose, a made-up precedent) arrives with the same surface confidence as a correct answer. The question I wanted to answer is narrow: when a model confabulates confidently, does anything in its generation dynamics give it away?
What I measured
I tracked two internal observables around the answer span:
- An entropy / predictive-uncertainty signal (call it Δentropy): how the model's output distribution shifts as it produces the answer.
- A representational-shift signal (Δcosine): how much the model's internal representation moves step to step.
A note on dimensionality, since it matters for honest reporting: I originally tracked four signals, but two pairs turned out to be perfectly correlated (r = 1.000), which means they're affine images of each other, not independent measurements. So there are really two independent axes, an uncertainty axis and a representational-shift axis, and I report on those.
The dataset is ~124 prompts spanning seven domains (science, history, medical, legal, technical, math, geography) and five fabrication levels (L0 = ordinary factual questions, through L3 to L4 = prompts built on increasingly fabricated premises, including pure counterfactuals). Each generation was behaviorally coded into one of three regimes:
- Factual: correct answer.
- Confident confabulation: confidently produces the false/ungrounded answer.
- Recognizes fabrication: flags the premise as false rather than playing along.
Two controls worth stating up front: the pre-generation baseline states were statistically identical across regimes (all p > 0.8), so nothing here is predictable from the resting state, only from the dynamics of generating the answer. And there was no within-session drift (all p > 0.7), ruling out the obvious temporal confound.
Results
The mean doesn't move. Comparing factual to confident confabulation, none of the raw directional signals separates the two: Δentropy p ≈ 0.28, Δcosine p ≈ 0.37. There is no consistent direction the model travels when it confabulates. This is the part that makes confident confabulation feel "indistinguishable from truth": on the mean, it is.
The magnitude does. Switch from the signed deltas to their absolute values, and a clear separation appears: |Δcosine| gives Cohen's d ≈ 0.58 (p ≈ 0.005), with a variance ratio of ~7× between confabulation and truth. Truth sits in a tight cluster; confabulation fans out. The discriminating quantity is dispersion, not displacement.
It's dose-dependent. Step-to-step representational variability climbs monotonically with fabrication level: the SD of Δcosine rises from ≈0.009 at L0 to ≈0.024 at L3, while the means bounce around with no trend. Within the fabrication conditions, pure fabrications produce roughly 2× the |Δcosine| of partial/half-truths (d ≈ 1.19, p ≈ 0.02), and counterfactuals are the most extreme at ~3.3× the global average. The more there is to fabricate, the more the trajectory destabilizes. A dose-response on the variance is the closest thing here to a causal fingerprint.
Recognition is the one directional regime. When the model catches the false premise rather than confabulating, it behaves differently in a directional way: entropy rises and representational similarity drops. Δcosine separates "recognizes fabrication" from "confident confabulation" at AUC ≈ 0.68. Modest, but the only place a single signed feature does meaningful work. So there appear to be three distinct internal postures: truth (stable), confident confabulation (same center, high variance), and recognition (a directional move toward higher entropy / lower cosine).

Figure 1. The three regimes in the Δentropy-Δcosine space. The clouds overlap heavily (which is why per-instance separation is hard), but the centroids differ, and the recognition regime sits toward the high-entropy / low-cosine region.

Figure 2. Confident confabulation shows the long tails and outliers in Δcosine that drive the variance gap; the recognition regime is the one with a visible shift in Δentropy.

Figure 3. Single observables barely separate factual from confident confabulation (AUC ≈ 0.45 to 0.56). Δcosine separates confident confabulation from recognition at AUC ≈ 0.68. A linear combination weighted toward the magnitude features reaches AUC ≈ 0.72.
What it means
The clean statement is: confident confabulation is directionally indistinguishable from truth but magnitude-distinguishable. Lying doesn't push the model along a "deception axis"; it destabilizes the trajectory. Truth is a stable attractor; confident confabulation explores a larger volume of representation space at the same average location.
That framing matters because it picks a side in a live methodological split. Most internal-state work looks for a direction (the "geometry of truth" line, contrastive and mass-mean probes, steering vectors), and that program keeps running into generalization trouble (probes that fail on negation, separability that's strongly layer-dependent, geometry that changes when you simply ask the model to assess correctness). Meanwhile the strongest output-side method, semantic entropy, is fundamentally a dispersion measure. This result is essentially the dispersion insight relocated to the internal side: for the confident case, the internal signal is variance, not a vector.
How this fits the literature
The nearest neighbor is Semantic Entropy Probes (Kossen et al.), which approximate semantic entropy from the hidden states of a single generation. The distinction I'd draw: SEPs predict an output-dispersion label via a direction in activation space, whereas this measures the variance of the trajectory itself, directly, and finds the discriminating signal in the second moment rather than the first. If a trajectory-variance statistic beats a probe-style approach specifically on confident confabulation, that's a contribution on the exact case the field concedes is unsolved.
Limitations
I'd rather state these plainly than have them found.
- Per-instance discriminability is modest. AUC ≈ 0.72 for the best linear combination; single features sit between chance and 0.68. This is a real aggregate effect, not a deployable per-token oracle.
- One model, ~124 prompts. Replication on a second architecture is the obvious next requirement.
- The domain breakdown is underpowered. Several domain × level cells have n ≤ 7 (one has n = 1), so I'd read no domain structure off it yet (Figure 4).
- Everything here is observational. The signatures correlate with confabulation; nothing yet shows you can change the behavior by acting on the signal.

Figure 4. Mean Δentropy by domain and fabrication level. Cell counts are small (n = 1 to 7), so this is included for completeness, not for domain-level claims.
Where this goes
Three concrete directions, in order of how much they'd move the result:
- Integrate the signal over the span. If the discriminating quantity is variance, then a single delta is the wrong feature; variance is a property of a trajectory. A running-variance or path-length statistic computed over the generated tokens should recover signal that snapshot features throw away, and I'd expect it to push discriminability well past the 0.72 of the per-point linear combination.
- Run the interventional test. The experiment that would actually matter: when the instability signal spikes mid-generation and you inject grounded context, does the trajectory variance collapse, and does the model shift from the confabulation posture toward the recognition posture (entropy up, cosine down) or toward abstention? That converts "instability correlates with confabulation" into "grounding causally restabilizes generation."
- Couple detection to intervention, not to a gate. At AUC ≈ 0.72, a hard suppression gate censors true statements about as often as it catches false ones. The better use is as a soft trigger for a grounded retrieval/memory layer: raise uncertainty and pull in evidence when the trajectory destabilizes, rather than silently dropping tokens. This is the direction I'm building toward with an active memory substrate (Recall) that can supply grounded context into the loop on demand.
Top comments (0)