Spicy

Posted on May 24

Why Does AI Make Things Up? A Dev's Guide to Hallucination

#ai #machinelearning #llm #developers

Quick version: LLMs don't look things up. They predict probable token sequences. When the model's training data is thin or absent on a topic, it doesn't stop — it keeps predicting. Fluently. Confidently. Incorrectly.

If you've been building with LLMs for more than a few weeks, you've hit this. Your app returns a convincing-sounding answer that is just wrong. A citation that doesn't exist. A method that was never in an SDK. A regulatory requirement that was invented. Let's break down why this happens and what you can actually do about it.

How LLMs Work — The Part That Explains Everything

An LLM is a next-token predictor. At inference time, the model takes your prompt plus its trained weights — which encode statistical patterns from an enormous corpus of text — and produces a probability distribution over possible next tokens. It samples from that distribution. Repeats until a stop token. Done.

There's no fact database behind this. No retrieval step unless you explicitly add one. No confidence threshold that pauses generation when the model isn't sure. The model just keeps predicting, because that's what it does.

When the training data had strong signal on a topic, the predictions are accurate. When the signal is weak, outdated, or absent, the predictions still look fluent — they're just not grounded in anything real. The model has no way to distinguish accurate recall from fluent confabulation from the inside.

The Confidence Problem

Here's what makes this genuinely tricky for production systems: output confidence is not correlated with accuracy.

Research has found that LLMs express higher confidence on incorrect answers than on correct ones in certain benchmark conditions. The model doesn't "know" it's guessing. It doesn't hedge unless you explicitly prompt it to do so. This is partly a training artifact — the text these models learn from is mostly assertive. Academic papers, documentation, news writing — none of it usually says "I'm not totally sure, but…". So the model defaults to that same assured tone whether it's recalling a well-documented fact or fabricating one from statistical noise.

For your users, this means they have no reliable signal for when to trust the output. Everything reads with the same confidence. That's the real problem.

Where Hallucination Hits Hardest in Dev Workflows

In practice, the highest-friction failure modes I've seen:

Library and API references for post-training releases — the model will describe method signatures that no longer exist or were never added
Less popular SDKs and packages — low training coverage means the model will invent plausible-sounding but wrong implementations
Security and cryptography guidance — subtle misstatements about auth flows or key handling are genuinely dangerous
Legal and compliance text — any LLM output touching regulatory specifics should be treated as unverified until checked against primary sources
Citation-heavy research tasks — the model generates convincing author names, journal titles, and publication years that don't exist

Hallucination Rates Across Major Models (April 2026)

Benchmarks vary significantly by methodology and task type, but this breakdown from the Vectara Hallucination Evaluation Framework (April 2026) gives a useful reference point:

Model	General Hallucination Rate	"I Don't Know" Rate
Claude Opus 4.x	Very low	High (~18.7%)
Gemini 2.0 Flash	Low	Medium (~12.3%)
GPT-4o	Low–medium	Medium
Llama 4 Maverick	Medium	Lower (~8.9%)

The "I don't know" rate is as important as raw hallucination rate. A model that refuses to answer when uncertain is structurally safer for high-stakes tasks than one that guesses fluently. Claude's refusal behavior makes it better suited for domains where a wrong answer is worse than no answer.

One important caveat: HalluHard (2026) found that even the best-performing configuration with web search still hallucinated roughly 30% of the time on complex multi-turn citation tasks. Benchmark scores have improved dramatically — the underlying problem hasn't been eliminated.

Prompting for Uncertainty

One immediately practical mitigation: explicitly instruct the model to surface its own uncertainty. A simple addition to your system prompt:

If you are not confident in a specific fact, citation, API method, or 
technical detail, say so explicitly. Do not fabricate sources or invent 
function signatures. If you don't know, say: "I don't have reliable 
information about this — please verify independently."

In practice, this consistently reduces fabricated citations and invented API references — without eliminating them entirely. Pair it with lower temperature settings (0.0–0.3) for factual tasks. Lower temperature reduces creative variation and, with it, some hallucination frequency. It's not a fix, but it moves the distribution in the right direction.

RAG as a Structural Fix

For production use cases where factual accuracy on specific content matters — internal documentation Q&A, support bots, repo-aware code assistants — Retrieval-Augmented Generation is the most effective structural mitigation available right now.

Instead of asking the model to recall from training data, you retrieve relevant source documents at inference time and include them in the context window. The model answers from the retrieved content rather than from weights. This grounds the output in verifiable sources and makes errors traceable to the retrieved chunk rather than invisible fabrication.

RAG doesn't eliminate hallucination entirely — models can still misinterpret retrieved content, and retrieval quality is its own problem — but it shifts the failure mode from silent fabrication to identifiable mis-reading. That's a much more debuggable state.

The Practical Checklist

Always validate AI-generated citations, API references, and version-specific details before shipping or publishing
Add explicit uncertainty instructions to your system prompt on any task where factual accuracy matters
Use RAG for use cases requiring accuracy on specific proprietary or domain content
Choose higher-refusal models for high-stakes domains where a wrong answer causes real harm
Build human review into any automated workflow where hallucinated output has downstream consequences

Hallucination is a property of the architecture, not a defect waiting to be patched. The models will keep improving — but working with them safely means accounting for it now.

For the non-technical take — why this matters beyond dev contexts, including documented legal cases and the broader trust problem — the full article is here: Why Does AI Make Things Up and Sound So Confident

Top comments (2)

Harjot Singh • May 31

The reframe that fixes most people's mental model is right at the top: the model isn't lying or broken when it hallucinates, it's doing the only thing it ever does, predicting a probable continuation, and "I don't know" is rarely the most probable continuation. That's why hallucinations are so dangerous, they're not a degraded output, they're a confident one produced by the exact same machinery as the correct answers, so there's no internal signal that says "this part is made up." The practical consequence I'd hammer: you cannot prompt your way out of this reliably, "don't make things up" is just more tokens to the predictor. The durable fixes are architectural, ground the model in retrieved real data so the answer is a summary of facts rather than a recollection, force citations and verify they resolve, and constrain output to a schema so invented fields can't slip through. Move correctness out of the model's good intentions and into a checkable structure. That ground-and-verify-don't-trust-recall stance is the whole basis of how I build Moonshift. Which of these has bought you the most reliability in practice, RAG grounding, or hard output validation?

Ethan Walker • May 25

Good intro. The piece readers most often miss is that post-hoc detectors are flagging outputs after the user already saw them. We run semantic entropy at generation time across 5 samples and gate any response above the 95th percentile of historical entropy. Cost is 4x on flagged slices, about 6% of production traffic, and on a 1k human-labeled audit 73% of the flagged outputs were genuinely wrong. The detector that fires at generation time beats the one that fires in the dashboard the next morning. What is the smallest validation set you would trust for calibrating a hallucination flag?