DEV Community

Evan-dong
Evan-dong

Posted on

Opus 4.6 Hallucination Rate Hit 33% — Here's What Changed and How to Fix It

If your Claude Code sessions have been producing more errors, skipping files, or fabricating APIs that don't exist — you're not imagining it.

Over the past two weeks, developers across GitHub, X, and YouTube have reported a measurable decline in Opus 4.6's coding quality. Independent benchmarks now confirm it: the model's hallucination rate has nearly doubled.

This post covers the evidence, the root cause, and the exact settings to fix it.

The Data

BridgeBench Hallucination Benchmark

BridgeBench measures how often AI models fabricate false claims when analyzing code — 30 tasks, 175 questions, verified against ground truth.

Opus 4.6's trajectory:

  • Previous: #2 with 83.3% accuracy (~17% fabrication)
  • Current: #10 with 68.3% accuracy (33% fabrication)

One in three responses now contains fabricated information.

Current leaderboard (April 14, 2026):

Model Accuracy Fabrication Rate Rank
Grok 4.20 Reasoning 91.8% 10.0% #1
GPT-5.4 86.1% 16.7% #2
Claude Opus 4.5 72.3% 27.9% #6
Claude Sonnet 4.6 72.4% 28.9% #7
Claude Opus 4.6 68.3% 33.0% #10

Notable: Sonnet 4.6 (smaller, cheaper) outperforms Opus 4.6 on accuracy.

Developer Testing

@om_patel5 ran the same prompt on Opus 4.6 and 4.5:

  • 4.6: failed 5 consecutive windows
  • 4.5: passed every time

His tweet got 682K views and 1,118 bookmarks. He now runs this as a "quantization canary" before every session.

6,852-Session Analysis

An AMD executive analyzed 6,852 Claude Code sessions and measured a 67% drop in reasoning depth compared to pre-February behavior.

Root Cause: Two Default Changes

Anthropic made two changes in early 2026:

1. Effort level default: high → medium (March 3, 2026)

The model now "conserves thinking" by default. Complex problems that need deep reasoning get classified as "simple enough" and receive shallow analysis.

2. Adaptive thinking introduced (February 9, 2026)

The model dynamically allocates reasoning tokens per turn. Under medium effort, some turns receive zero reasoning tokens — the model answers without thinking at all.

These two changes compound: the model skips thinking precisely when you need it most.

The Fix

Quick fix (per session)

/effort max
Enter fullscreen mode Exit fullscreen mode

Permanent fix (environment variables)

export CLAUDE_CODE_EFFORT_LEVEL=max
export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1
Enter fullscreen mode Exit fullscreen mode

Add these to your .bashrc or .zshrc.

Nuclear option: switch to Opus 4.5

Set model to claude-opus-4-5-20251101. Slower and more expensive, but consistently reliable.

Quick Reference

Problem Fix Command
Session feels dumb Max effort /effort max
Resets every session Env var CLAUDE_CODE_EFFORT_LEVEL=max
Zero-reasoning turns Disable adaptive CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1
Still unreliable Use Opus 4.5 Model: claude-opus-4-5-20251101

Model Switching in Production

If you're calling Claude via API in production, switching models means changing endpoints, auth, and billing for each provider. A unified API gateway like EvoLink lets you swap between 30+ models by changing one parameter. The Smart Router (evolink/auto) can automatically route deep-reasoning tasks to more reliable models.


Sources:

  1. BridgeBench Hallucination Benchmark — bridgebench.ai/hallucination
  2. @om_patel5 on X (Apr 10, 2026, 682K+ views)
  3. GitHub Issue #42796 — github.com/anthropics/claude-code/issues/42796
  4. Digit.in — AMD executive's 6,852-session analysis
  5. pasqualepillitteri.it — effort/adaptive thinking configuration guide

Top comments (0)