Skila AI

Posted on May 14 • Originally published at news.skila.ai

OpenAI Just Admitted It: AI Hallucinations Are Mathematically Impossible to Fix

#programming #ai #machinelearning #productivity

Originally published on Skila AI

OpenAI's own September 2025 paper proved AI hallucinations are mathematically inevitable. The total error rate is at least 2x the yes/no error rate, and 9 of 10 frontier benchmarks reward guessing over honesty. Stop waiting for a fix. Plan around it.

On September 4, 2025, Adam Kalai, Ofir Nachum, and Edwin Zhang from OpenAI — plus Santosh Vempala from Georgia Tech — published arXiv:2509.04664, "Why Language Models Hallucinate". The paper proves that for any large language model trained on next-token prediction, hallucinations are not a bug. They are a mathematically unavoidable feature.

Eight months later, the implications are landing. On May 12, 2026, Anthropic shipped 12 new legal Claude plugins and 20 legal connectors — deposition prep, bar-exam coaching, case-law research, file drafting, plus integrations with DocuSign, Thomson Reuters, Harvey, and Everlaw. AI is now being pushed into the single highest-stakes hallucination zone in the economy. The same week, Damien Charlotin's public legal-hallucination database ticked past 120 documented court cases where AI tools fabricated quotes, made up case names, or invented citations that don't exist.

The myth this article busts is the most expensive one in AI: "hallucinations will be fixed in the next model." They won't. And the math says why.

The Myth Everyone Believes

Walk into any room where AI buyers gather. Boardrooms. Investor calls. Procurement meetings. The same sentence shows up: "GPT-5 still hallucinates a bit, but the next version will solve it."

That sentence funds budgets. It silences risk officers. It postpones every serious workflow audit by another quarter. And it is wrong in a way that can be proven on a whiteboard.

Here is the version of the myth the industry quietly sells you:

Hallucinations are a temporary engineering flaw.
Better data + bigger models + more RLHF = the rate drops to zero.
One day soon, an AI will be "trustworthy enough" to ship without human review.

OpenAI itself just published the paper that demolishes all three claims.

What the OpenAI Paper Actually Proves

The Kalai & Vempala paper has two arguments. The first is statistical. The second is sociological. Both kill the myth.

1. Generating sentences is mathematically harder than answering yes/no

Imagine a model that's wrong on a simple binary question ("Did X happen, yes or no?") 5% of the time. The paper proves something brutal: when that same model has to generate a sentence containing the same fact, its error rate is at least double the binary rate. Errors compound across every prediction it has to chain together.

This is the "Is-It-Valid reduction." The paper formally shows that generating a valid sentence cannot be easier than checking whether one is valid. So whatever your best binary classifier can do, your generator is, by definition, twice as bad. There is no architecture choice that escapes this. It's a floor, not a ceiling.

2. 9 of 10 frontier benchmarks reward guessing over honesty

The paper's second argument is the one that should haunt every CTO. The authors surveyed 10 widely-used benchmarks that frontier labs train against. Nine of the ten use binary grading. Right answer = 1 point. Wrong answer = 0 points. "I don't know" = 0 points.

Under that scoring, a model that guesses confidently always beats a model that hedges. So gradient descent does what gradient descent does: it learns to guess. Confidently. Even when it shouldn't. The training signal literally rewards making things up over saying "I'm not sure."

That's why even GPT-5, Claude Opus 4.7, and Gemini 3 still hallucinate. Not because the engineering is sloppy. Because the scoreboard is broken.

The 2026 Hallucination Rates Nobody Wants to Print

So how bad is it in practice, eight months after the paper dropped? Worse than the marketing implies.

Two independent 2026 benchmark studies tested every frontier model on factual tasks. The results:

Frontier hallucination range: 3.1% to 19.1% depending on task family.
Citation accuracy is the worst-performing task — 12.4% hallucination rate even with extended thinking enabled.
"Extended thinking" / reasoning modes help on math, but barely move citation accuracy. The reasoning loop just produces more confidently wrong citations.
Smaller open-source models hallucinate 2-4x more than frontier ones — but the floor is still non-zero.

Translation: if you have an AI workflow that produces a 1,000-word deliverable with 8 factual claims, you should expect 0.25 to 1.5 hallucinations per output from the best models on the market. That number is not trending to zero. It hasn't moved meaningfully since GPT-4. The paper explains why it won't.

The May 12 News That Makes This Urgent

On May 12, 2026, Anthropic announced 12 new Claude plugins and 20 connectors for legal work. Bar-exam prep. Deposition drafting. Case-law research. Direct connectors to DocuSign, Thomson Reuters, Harvey, and Everlaw.

This is the most aggressive push of generative AI into a high-stakes citation-heavy domain anyone has ever shipped. And it lands in a world where Damien Charlotin's live tracker of fabricated AI citations in real court filings just passed 120 cases. Lawyers have been sanctioned. Briefs have been thrown out. Judges have started ordering AI-disclosure declarations.

Two things are now true at the same time:

The math says hallucinations are permanent.
The biggest AI labs are accelerating into the domains where hallucinations cause the most damage.

This is the gap. And it's not closing.

Why "Just Use RAG" Doesn't Save You

Every time someone shows a hallucination, an AI engineer says "that's why we use RAG" (retrieval-augmented generation — let the model cite a real document). It helps. It doesn't fix the underlying problem.

The Kalai/Vempala result still applies inside a RAG pipeline. The model still has to:

Generate a query against the retrieval index (can hallucinate the query intent)
Decide which retrieved chunk is relevant (binary classification — still error-prone)
Synthesize an answer from the chunks (generation — still 2x the binary error)
Cite the chunk correctly (separate generation task — also 2x)

That's why independent audits of RAG-based legal AI still find hallucination rates of 6-17% in the wild. RAG is a multiplier on accuracy, not a fix.

What This Means for Your AI Strategy

If hallucinations are permanent, your AI roadmap can't be "wait for the model to get better." That roadmap is dead. Three new principles replace it.

Principle 1: Calibrate uncertainty, don't suppress it

Models trained on binary benchmarks suppress uncertainty signals because the scoring punishes them. You can undo this in your prompts. Force the model to output a confidence score with every claim. Reject any output below your threshold. Yes, you'll get fewer answers. The remaining answers will be 5-10x more trustworthy.

Principle 2: Make verification cheaper than generation

If your workflow generates 100 AI outputs an hour but a human can only verify 10, the other 90 are unverified slop entering production. Invert it. Use AI to generate and verify, but cap generation throughput at human review capacity. You'll ship slower. Nothing you ship will embarrass you in court.

Principle 3: Buy tools that assume the math, not tools that deny it

This is the buying-decision layer. Vendors who pitch "99% accuracy" and "hallucination-free" are pitching against a published math proof. They will lose. Buy from vendors who tell you their hallucination rate, show you their human-in-the-loop workflow, and ship audit logs by default.

The Bigger Lesson: Stop Outsourcing Math to Marketing

Three years of AI hype trained the market to believe every limitation is a roadmap item. "Context window too small? Wait six months." "Cost too high? Wait six months." "Hallucinates? Wait six months."

For context and cost, that worked. Both dropped by orders of magnitude. For hallucinations, the math is structurally different. There is no Moore's-law curve here. There is a proof.

The companies who keep pretending will burn money, shipping unreliable agents into production and paying the cleanup bill. The companies who internalize the OpenAI paper will quietly build workflows where AI does 80% of the work and humans verify the last 20% — and they will dominate every regulated industry over the next five years.

The myth was: "AI hallucinations will be fixed soon."

The truth is: AI hallucinations are a permanent feature of next-token prediction. Design for them. Buy around them. Ship anyway.

Key Takeaways

OpenAI's Sept 4 2025 paper (Kalai, Nachum, Zhang, Vempala) proves hallucinations are mathematically inevitable.
9 of 10 frontier AI benchmarks award zero points for "I don't know" — they train models to guess.
Frontier 2026 hallucination rates still 3.1-19.1%. Citation accuracy worst at 12.4% even with extended thinking.
120+ court cases with AI-fabricated citations as of May 2026.
Anthropic shipped 12 legal Claude plugins + 20 connectors on May 12, 2026.
Design workflows that assume hallucination will happen. Verify before it hits a deliverable.

Read the full article with sources at news.skila.ai

DEV Community