Claude Opus 4.8: Benchmarks, Fast Mode, and What Actually Changed
TL;DR — Anthropic shipped Claude Opus 4.8 on May 28, 2026, at the same $5/$25 price as 4.7. It tops Artificial Analysis's GDPval-AA real-work leaderboard at 1890 Elo (+121 over GPT-5.5, +137 over 4.7), hits 69.2% on SWE-bench Pro, and does it using ~35% fewer output tokens than 4.7.
What Anthropic Shipped
Claude Opus 4.8 launched May 28, 2026, maintaining the same list price as Opus 4.7: $5 per million input, $25 per million output. 1M-token context window by default on the Claude API (200K on Microsoft Foundry), 128K max output tokens.
The key differentiator is that it achieves superior performance while reducing token consumption compared to its predecessor.
The GDPval-AA Result
Opus 4.8 (max effort) debuts at 1890 Elo, pulling 121 points clear of GPT-5.5 in second place and +137 over its own predecessor.
Independent evaluation from Artificial Analysis tested models on real economic work tasks across 44 occupations, providing each with shell access and web browsing capabilities within an agentic loop.
Opus 4.8 reached this score using 15% fewer turns and 35% fewer output tokens per task than Opus 4.7.
Benchmarks vs. the Field
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Pro | 69.2% | 64.3% | 58.6% | 54.2% |
| OSWorld-Verified (computer use) | 83.4% | 82.8% | 78.7% | 76.2% |
| Terminal-Bench 2.1 | 74.6% | 66.1% | 78.2% | 70.3% |
| Humanity's Last Exam (with tools) | 57.9% | — | — | — |
| Finance Agent v2 | 53.9% | — | — | — |
| GDPval-AA (Elo) | 1890 | 1753 | 1769 | — |
GPT-5.5 still wins Terminal-Bench 2.1 (78.2% vs 74.6%). If your workload is heavy on raw terminal command sequences, that's a real data point, not a rounding error.
What's New Under the Hood
Fast Mode. A research preview that serves the same Opus 4.8 model at up to 2.5x higher output tokens per second, at premium pricing.
Mid-conversation system messages. Users can now insert system messages after user turns, preserving prompt-cache hits on earlier turns and reducing input costs.
Adaptive thinking, effort default high. Use thinking: {"type": "adaptive"} and the effort parameter instead of extended thinking budgets.
Better tool triggering and compaction. Improvements in long-horizon agentic coding with fewer compactions and better recovery.
Prompting Opus 4.8: What Actually Changed
Effort is now the main dial. Start at xhigh for coding and agentic use cases, and keep a minimum of high for anything intelligence-sensitive.
It follows instructions literally. The model won't silently generalize instructions or infer unstated requests.
It favors reasoning over tool calls. Raising effort to high/xhigh produces substantially more tool use.
The code-review recall trap. Opus 4.8 is genuinely better at finding bugs (higher precision and recall in Anthropic's evals), but if your review harness says "only report high-severity issues" or "be conservative," 4.8 follows that more faithfully than older models.
The "Most Honest" Claim
Anthropic positions Opus 4.8 as having fewer confident fabrications, less sycophancy, and clearer refusals.
Launched Alongside: Dynamic Workflows in Claude Code
Dynamic workflows let Claude orchestrate tens to hundreds of parallel subagents in a single session.
The featured example involved porting Bun from Zig to Rust — roughly 750,000 lines of code, with a 99.8% test-suite pass rate, in 11 days.
Two limitations: it's plan-gated (dynamic workflows run on Claude Code Max, Team, and Enterprise plans), and token consumption is substantially higher than a normal session.
How to Access Opus 4.8 via ofox.ai
The model ID is anthropic/claude-opus-4.8, accessible through the same OpenAI-compatible endpoint with no separate billing.
import anthropic
client = anthropic.Anthropic(
base_url="https://api.ofox.ai/anthropic",
api_key="your-ofox-key",
)
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=4096,
thinking={"type": "adaptive"},
messages=[{"role": "user", "content": "Audit this service for race conditions..."}],
)
Verdict
Opus 4.8 is the rare upgrade with no asterisk on price: same $5/$25, higher scores across coding and computer-use, top of the independent real-work leaderboard, and fewer output tokens per task.
Originally published on ofox.ai/blog.
Top comments (0)