Owen

Posted on May 28 • Originally published at ofox.ai

Claude Opus 4.8: Benchmarks, Fast Mode, and What Actually Changed

#ai #anthropic #claude #benchmarks

Claude Opus 4.8: Benchmarks, Fast Mode, and What Actually Changed

TL;DR — Anthropic shipped Claude Opus 4.8 on May 28, 2026, at the same $5/$25 price as 4.7. It tops Artificial Analysis's GDPval-AA real-work leaderboard at 1890 Elo (+121 over GPT-5.5, +137 over 4.7), hits 69.2% on SWE-bench Pro, and does it using ~35% fewer output tokens than 4.7.

What Anthropic Shipped

Claude Opus 4.8 launched May 28, 2026, maintaining the same list price as Opus 4.7: $5 per million input, $25 per million output. 1M-token context window by default on the Claude API (200K on Microsoft Foundry), 128K max output tokens.

The key differentiator is that it achieves superior performance while reducing token consumption compared to its predecessor.

The GDPval-AA Result

Opus 4.8 (max effort) debuts at 1890 Elo, pulling 121 points clear of GPT-5.5 in second place and +137 over its own predecessor.

Independent evaluation from Artificial Analysis tested models on real economic work tasks across 44 occupations, providing each with shell access and web browsing capabilities within an agentic loop.

Opus 4.8 reached this score using 15% fewer turns and 35% fewer output tokens per task than Opus 4.7.

Benchmarks vs. the Field

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-bench Pro	69.2%	64.3%	58.6%	54.2%
OSWorld-Verified (computer use)	83.4%	82.8%	78.7%	76.2%
Terminal-Bench 2.1	74.6%	66.1%	78.2%	70.3%
Humanity's Last Exam (with tools)	57.9%	—	—	—
Finance Agent v2	53.9%	—	—	—
GDPval-AA (Elo)	1890	1753	1769	—

GPT-5.5 still wins Terminal-Bench 2.1 (78.2% vs 74.6%). If your workload is heavy on raw terminal command sequences, that's a real data point, not a rounding error.

What's New Under the Hood

Fast Mode. A research preview that serves the same Opus 4.8 model at up to 2.5x higher output tokens per second, at premium pricing.

Mid-conversation system messages. Users can now insert system messages after user turns, preserving prompt-cache hits on earlier turns and reducing input costs.

Adaptive thinking, effort default high. Use thinking: {"type": "adaptive"} and the effort parameter instead of extended thinking budgets.

Better tool triggering and compaction. Improvements in long-horizon agentic coding with fewer compactions and better recovery.

Prompting Opus 4.8: What Actually Changed

Effort is now the main dial. Start at xhigh for coding and agentic use cases, and keep a minimum of high for anything intelligence-sensitive.

It follows instructions literally. The model won't silently generalize instructions or infer unstated requests.

It favors reasoning over tool calls. Raising effort to high/xhigh produces substantially more tool use.

The code-review recall trap. Opus 4.8 is genuinely better at finding bugs (higher precision and recall in Anthropic's evals), but if your review harness says "only report high-severity issues" or "be conservative," 4.8 follows that more faithfully than older models.

The "Most Honest" Claim

Anthropic positions Opus 4.8 as having fewer confident fabrications, less sycophancy, and clearer refusals.

Launched Alongside: Dynamic Workflows in Claude Code

Dynamic workflows let Claude orchestrate tens to hundreds of parallel subagents in a single session.

The featured example involved porting Bun from Zig to Rust — roughly 750,000 lines of code, with a 99.8% test-suite pass rate, in 11 days.

Two limitations: it's plan-gated (dynamic workflows run on Claude Code Max, Team, and Enterprise plans), and token consumption is substantially higher than a normal session.

How to Access Opus 4.8 via ofox.ai

The model ID is anthropic/claude-opus-4.8, accessible through the same OpenAI-compatible endpoint with no separate billing.

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.ofox.ai/anthropic",
    api_key="your-ofox-key",
)

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    thinking={"type": "adaptive"},
    messages=[{"role": "user", "content": "Audit this service for race conditions..."}],
)

Verdict

Opus 4.8 is the rare upgrade with no asterisk on price: same $5/$25, higher scores across coding and computer-use, top of the independent real-work leaderboard, and fewer output tokens per task.

Originally published on ofox.ai/blog.

DEV Community

Claude Opus 4.8: Benchmarks, Fast Mode, and What Actually Changed

Claude Opus 4.8: Benchmarks, Fast Mode, and What Actually Changed

What Anthropic Shipped

The GDPval-AA Result

Benchmarks vs. the Field

What's New Under the Hood

Prompting Opus 4.8: What Actually Changed

The "Most Honest" Claim

Launched Alongside: Dynamic Workflows in Claude Code

How to Access Opus 4.8 via ofox.ai

Verdict

Top comments (0)