Anthropic Claude Opus 4.6

Anthropic has officially released Claude Opus 4.6 — and the benchmark numbers speak volumes.

Key Performance Highlights
• GDPval-AA Elo: Opus 4.6 outperforms its predecessor (Opus 4.5) by ~190 Elo points and beats OpenAI’s GPT-5.2 by ~144 Elo points on economically valuable knowledge work tasks.

• Terminal-Bench 2.0 (agentic coding): Achieves a leading score of ~65.4%, placing it at the top of real-world coding and task automation benchmarks.

• Higher context retention: On an 8-needle 1M variant of MRCR v2 (needle-in-haystack benchmark), Opus 4.6 scores 76% vs. ~18.5% for Sonnet 4.5 — a massive uplift in long-context retrieval.

• BigLaw Bench (legal reasoning): Achieves 90.2%, including 40% perfect scores and 84% above 0.8.

• Across internal evaluations, Opus 4.6 leads on deep multi-step reasoning, search, and agentic workflows compared with other frontier models.

What this means:
This isn’t just an incremental update — it’s a meaningful leap in real-world task performance for coding, reasoning, multi-agent planning, and large-context work. Whether you’re building AI agents, automating workflows, or tackling enterprise knowledge work, these numbers signal greater reliability and capability on complex tasks.

Opus 4.6 now sets a new benchmark bar for frontier LLM performance — especially where depth, persistence, and real-world reasoning matter most.

DEV Community

Anthropic Claude Opus 4.6

Top comments (0)