Evan-dong

Posted on Apr 14

Opus 4.6 Hallucination Rate Hit 33% — Here's What Changed and How to Fix It

#ai #api #image #tutorial

If your Claude Code sessions have been producing more errors, skipping files, or fabricating APIs that don't exist — you're not imagining it.

Over the past two weeks, developers across GitHub, X, and YouTube have reported a measurable decline in Opus 4.6's coding quality. Independent benchmarks now confirm it: the model's hallucination rate has nearly doubled.

This post covers the evidence, the root cause, and the exact settings to fix it.

The Data

BridgeBench Hallucination Benchmark

BridgeBench measures how often AI models fabricate false claims when analyzing code — 30 tasks, 175 questions, verified against ground truth.

Opus 4.6's trajectory:

Previous: #2 with 83.3% accuracy (~17% fabrication)
Current: #10 with 68.3% accuracy (33% fabrication)

One in three responses now contains fabricated information.

Current leaderboard (April 14, 2026):

Model	Accuracy	Fabrication Rate	Rank
Grok 4.20 Reasoning	91.8%	10.0%	#1
GPT-5.4	86.1%	16.7%	#2
Claude Opus 4.5	72.3%	27.9%	#6
Claude Sonnet 4.6	72.4%	28.9%	#7
Claude Opus 4.6	68.3%	33.0%	#10

Notable: Sonnet 4.6 (smaller, cheaper) outperforms Opus 4.6 on accuracy.

Developer Testing

@om_patel5 ran the same prompt on Opus 4.6 and 4.5:

4.6: failed 5 consecutive windows
4.5: passed every time

His tweet got 682K views and 1,118 bookmarks. He now runs this as a "quantization canary" before every session.

6,852-Session Analysis

An AMD executive analyzed 6,852 Claude Code sessions and measured a 67% drop in reasoning depth compared to pre-February behavior.

Root Cause: Two Default Changes

Anthropic made two changes in early 2026:

1. Effort level default: high → medium (March 3, 2026)

The model now "conserves thinking" by default. Complex problems that need deep reasoning get classified as "simple enough" and receive shallow analysis.

2. Adaptive thinking introduced (February 9, 2026)

The model dynamically allocates reasoning tokens per turn. Under medium effort, some turns receive zero reasoning tokens — the model answers without thinking at all.

These two changes compound: the model skips thinking precisely when you need it most.

The Fix

Quick fix (per session)

/effort max

Permanent fix (environment variables)

export CLAUDE_CODE_EFFORT_LEVEL=max
export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1

Add these to your .bashrc or .zshrc.

Nuclear option: switch to Opus 4.5

Set model to claude-opus-4-5-20251101. Slower and more expensive, but consistently reliable.

Quick Reference

Problem	Fix	Command
Session feels dumb	Max effort	`/effort max`
Resets every session	Env var	`CLAUDE_CODE_EFFORT_LEVEL=max`
Zero-reasoning turns	Disable adaptive	`CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1`
Still unreliable	Use Opus 4.5	Model: `claude-opus-4-5-20251101`

Model Switching in Production

If you're calling Claude via API in production, switching models means changing endpoints, auth, and billing for each provider. A unified API gateway like EvoLink lets you swap between 30+ models by changing one parameter. The Smart Router (evolink/auto) can automatically route deep-reasoning tasks to more reliable models.

Sources:

BridgeBench Hallucination Benchmark — bridgebench.ai/hallucination
@om_patel5 on X (Apr 10, 2026, 682K+ views)
GitHub Issue #42796 — github.com/anthropics/claude-code/issues/42796
Digit.in — AMD executive's 6,852-session analysis
pasqualepillitteri.it — effort/adaptive thinking configuration guide

DEV Community