Claude Opus 4.8: What Developers Need to Know About Anthropic's New Flagship

#ai #llm #claude #devops

Anthropic shipped Claude Opus 4.8 today. Same price as Opus 4.7, fast mode at 2.5x speed, fast mode 3x cheaper than before. Alongside the model release: dynamic workflows in Claude Code and effort control in claude.ai.

This post covers the benchmark numbers, the practical changes for coding and agents, and what teams building on Claude should pay attention to.

Benchmark Numbers

The numbers that matter most for developers:

SWE-Bench Pro (agentic coding): Opus 4.8 = 69.2%, Opus 4.7 = 64.3%, GPT-5.5 = 58.6%, Gemini 3.1 Pro = 54.2%. A 4.9 point gain over the previous version and a 10.6 point lead over GPT-5.5.

Terminal-Bench 2.1 (agentic terminal coding): Opus 4.8 = 74.6%, GPT-5.5 = 78.2%, Gemini 3.1 Pro = 70.3%. GPT-5.5 leads this benchmark. Opus 4.8 still jumps 8.5 points over Opus 4.7's 66.1%.

OSWorld-Verified (agentic computer use): Opus 4.8 = 83.4%, GPT-5.5 = 78.7%. Browser agent hits 84% on Online-Mind2Web, beating both Opus 4.7 and GPT-5.5.

Humanity's Last Exam (reasoning, with tools): Opus 4.8 = 57.9%, GPT-5.5 = 52.2%, Gemini 3.1 Pro = 51.4%.

Finance Agent v2: Opus 4.8 = 53.9%, GPT-5.5 = 51.8%. First model to break 10% on the all-pass Legal Agent Benchmark.

For cost comparisons across models and workloads, the LLM calculator on ComparEdge is useful for running specific scenarios.

What Changed for Code Quality and Tool Calling

The most relevant change for daily work: Opus 4.8 is roughly 4x less likely than Opus 4.7 to let code flaws pass unremarked. It catches its own mistakes more often, and it pushes back when a plan has problems.

Devin's team confirmed the improvements directly: "Claude Opus 4.8 uses tools cleanly and follows instructions with the consistency our autonomous engineering workloads need to keep running unattended. It improves on Opus 4.6 and fixes the comment-verbosity and tool-calling issues we saw with Opus 4.7."

CursorBench reported that Opus 4.8 exceeds prior Opus models across every effort level, with more efficient tool calling overall.

Tom Pritchard, Staff Engineer at Shopify: "Claude Opus 4.8 has noticeably better judgment. In Claude Code, it asks the right questions, catches its own mistakes, pushes back when a plan isn't sound, and builds up confidence around complex, multi-service explorations before making big changes. It's a great model to build with."

Kay Zhu, Co-Founder and CTO: "On our Super-Agent benchmark, Claude Opus 4.8 is the only model to complete every case end-to-end, beating prior Opus models and GPT-5.5 at parity on cost."

Dynamic Workflows in Claude Code

The biggest feature launch alongside the model: dynamic workflows, available as a research preview in Claude Code. The model plans work and runs hundreds of parallel subagents in a single session. Anthropic says this enables codebase-scale migrations across hundreds of thousands of lines of code, from kickoff to merge.

Available for Enterprise, Team, and Max plans.

This is particularly relevant for large refactors, framework migrations, and cross-service changes where manual orchestration of multiple Claude sessions was previously the only option.

Alignment Improvements

Misaligned behavior (deception, cooperation with misuse) is substantially lower than Opus 4.7. Opus 4.8 scores near 1.83 on Anthropic's misalignment metric, comparable to Mythos Preview (their best-aligned model). Opus 4.7 scored 2.47. This matters for teams running autonomous agents where the model operates without constant human review.

Pricing

Same price as Opus 4.7. Fast mode at 2.5x speed, 3x cheaper than fast mode on previous models. Databricks reported 61% cheaper token cost for their Genie agent compared to Opus 4.7.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.