Anthropic shipped Claude Opus 4.8 today. The benchmark numbers went up, as they always do. But that's not why I'm switching my default model, and I want to explain the part that actually changed how I work.
The numbers, quickly
Here's the official comparison:
The highlights:
- SWE-Bench Pro: 69.2% — up from 64.3% on 4.7, well ahead of GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%).
- Computer use (OSWorld-Verified): 83.4% — still the model to beat for clicking around real UIs.
- Knowledge work (GDPval-AA): 1890 vs 1769 for GPT-5.5.
- Reasoning (Humanity's Last Exam): 49.8% no tools / 57.9% with tools — top of the table.
And one I'll call out honestly: on Terminal-Bench 2.1, Opus 4.8 scores 74.6% and GPT-5.5 wins at 78.2%. 4.8 jumped a lot from its predecessor (66.1%), but it isn't first on that one. Pick your model for what you actually do.
The part that matters more than any benchmark
Opus 4.8 is roughly 4x less likely than 4.7 to let a code flaw pass without flagging it. It proactively points out uncertainty, questions sketchy inputs, and pushes back on plans it thinks are unsound.
That sounds small. It isn't.
When you hand work to an agent, raw capability was never the real bottleneck — silent failure was. The model that writes a subtle off-by-one and says nothing costs you more than the model that's slightly worse but says "I'm not sure this input is ever non-null, can you confirm?"
Concretely:
Before: it writes a function that looks clean, ships a hidden edge-case bug, says nothing. You find it in production.
After: it writes the same function and adds "there's an edge case here I'm not confident about — double-check the input is non-empty," or flat-out tells you your plan has a hole.
For anyone treating Claude as a coworker that ships work unattended, that calibrated honesty is worth more than a few benchmark points.
Three product changes worth knowing
- Dynamic Workflows (Claude Code research preview) — runs hundreds of parallel subagents for big jobs like migrating a codebase across hundreds of thousands of lines.
- Effort control (claude.ai, Cowork) — you pick how hard it thinks. Higher = deeper, lower = faster. The speed/quality trade-off is back in your hands.
-
Messages API now accepts
systementries mid-array without breaking the prompt cache — you can inject new instructions partway through a long task and keep your cache. If you build long-running agents, you already know why this matters.
Pricing didn't move
- Regular: $5 / 1M input, $25 / 1M output — same as 4.7.
- Fast mode: $10 / 1M input, $50 / 1M output — 3x cheaper than the previous fast tier, and it's still Opus, not a smaller model.
Databricks reported 61% lower token cost vs 4.7 on their workloads, because 4.8 uses tools more efficiently and takes fewer steps.
Model ID is claude-opus-4-8, available everywhere today.
My take
The next moat in agents isn't IQ. It's calibrated honesty — the model that tells you when it's unsure is the one you can actually delegate to. That's the upgrade I care about here.
Numbers and image from Anthropic's announcement. Full evals are in the system card.

Top comments (0)