Anthropic shipped Opus 4.8 into Claude Code with a familiar promise: better agentic coding. Does it make real developers more confident leaving Claude Code alone on production-shaped work?
TL;DR
- Anthropic calls Opus 4.8 a “modest but tangible improvement.” That is the right frame.
- The coding numbers are better, especially on harder agentic benchmarks, but they do not settle the model-ranking argument by themselves.
- Claude Code quality still depends on the harness: effort level, context compaction, prompt cache behavior, tool permissions, and launch stability.
- Pricing is premium. The right metric is not dollars per million tokens. It is dollars per accepted engineering outcome.
- My read: use Opus 4.8 for hard, multi-step work where a failed agent loop costs real time. Do not use it for cheap bulk edits by default.
What changed technically
The model details are developer-relevant. Anthropic lists Opus 4.8 at a 1M token context window on Claude API, Bedrock, and Vertex AI, with 128K maximum output tokens. Microsoft Foundry is capped lower at 200K context.
The Messages API now accepts system entries inside the messages array, which means instructions can be changed mid-task without breaking the prompt cache in the same way. That sounds small, but it is exactly the kind of feature that matters for long-running coding agents: “plan first,” “now patch,” “now review your patch under stricter rules.”
Claude Code also gets dynamic workflows in research preview for Max, Team, and Enterprise plans. Anthropic describes this as Claude planning work and launching many parallel subagents. The headline example is Jarred Sumner using it on the Bun Zig-to-Rust port: roughly 750,000 lines of Rust, 99.8% of the existing test suite passing, and 11 days from first commit to merge.
What developers are reporting
Simon Willison highlighted the mid-conversation system prompt feature as practically interesting, and posted a small cost example where the best result used 25 input tokens and 17,167 output tokens, costing about 43 cents.
Hacker News reaction is mixed in the usual useful way. Some developers see benchmark fatigue: Opus 4.6, 4.7, and 4.8 all claim improvements, but day-to-day coding gains are harder to feel. Others argue the current coding benchmarks miss the parts of software engineering that hurt: unclear requirements, repo-specific conventions, migrations, flaky tests, and review cost. A recurring practical tip was to set Claude Code effort to xhigh for serious work. Another thread reported launch-day Claude Code breakage around thinking blocks.
Theo Browne’s developer-focused take is that Opus 4.8 is a “modest but tangible improvement,” especially for TypeScript-heavy work and “Claude special” UI tasks, but not a reason to ignore the old Claude Code risks. He treats benchmark wins like SWE Bench Pro cautiously, still sees GPT-5.5 xhigh as stronger in his mini-SWE-agent harness, and warns that dynamic workflows / “Ultra Code” can turn Claude into a powerful parallel coordinator for audits, bug hunts, and migrations while also burning money absurdly fast. His practical advice is to write detailed prompts up front, keep a root CLAUDE.md, monitor spend with CC usage, resume from summaries when limits hit, and verify everything, because Opus 4.8 can still hallucinate details like CLI flags.
Pricing: what it costs in reality
The obvious objection is price.
| Model/path | Sticker price | Practical read |
|---|---|---|
| Claude Opus 4.8 | $5 input / $25 output per MTok | Expensive, plausible for hard agent loops |
| GPT-5.5 | $5 / $30 short context; $10 / $45 long context | Similar frontier tier, output can cost more |
| Gemini 3.5 Flash | $1.50 / $9 | Better default for cheaper high-volume work |
| DeepSeek V4 | Much cheaper | Strong cost pressure; workflow quality varies |
Anthropic also offers batch pricing at half price: $2.50 input / $12.50 output per MTok for Opus 4.8. Prompt caching is $6.25/MTok for five-minute cache writes, $10/MTok for one-hour cache writes, and $0.50/MTok for cache hits. Fast Mode for Opus 4.8 is $10 input and $50 output per MTok for up to 2.5x higher output tokens per second, which is cheaper than the $30 input / $150 output Fast Mode pricing listed for Opus 4.6 and 4.7. But there is a catch: Anthropic says Opus 4.7 and later may use up to 35% more tokens for the same fixed text because of the tokenizer.
The real pricing question is not whether Opus 4.8 is expensive per token. It is whether it reduces failed or supervised loops enough to justify being on the critical path. If it prevents one botched refactor pass, it can be cheaper than a lower-priced model that burns context, leaves a half-correct diff, and hands the cleanup back to you.
Dynamic workflows complicate this. Parallel subagents can multiply progress. They can also multiply spend. Anthropic’s own docs warn that a workflow can use “meaningfully more tokens” than a normal conversation, counts against plan usage and rate limits, and can fan out to as many as 16 concurrent agents and 1,000 agents total per run. Public Claude Code issue reports make the downside less theoretical: Max users have reported hitting “out of extra usage” after one task and 155 tool uses in 9.5 minutes, and another Opus report claimed the limit arrived in roughly 10 minutes after about 20 prompts. Use workflows where the work decomposes cleanly and the result can be verified with tests, not as the default path for every substantial request.
Where Opus 4.8 fits
Artificial Analysis reports Opus 4.8 as a top-tier model, with a high Intelligence Index score and unusually heavy token usage during evaluation.
I would route Opus 4.8 to:
- multi-file planning and risky refactors,
- hard debugging across services,
- codebase exploration before architectural decisions,
- security-sensitive review with strict instructions,
- long Claude Code sessions where recovery matters.
I would route cheaper models to:
- rote edits,
- test scaffolding,
- formatting,
- classification,
- bulk summarization,
- low-risk subagents.
The split is simple: use Opus where failure is expensive. Use cheaper models where retrying is cheap.
Practical verdict
Try Opus 4.8 immediately if you already pay for Claude Code and have high-value work that currently fails because the agent loses the plot. Set effort deliberately. Demand tests. Review diffs like you would review a fast junior engineer with infinite stamina and uneven judgment.
Wait if your work is mostly small patches, hobby automation, or batch code cleanup. Gemini, DeepSeek, Sonnet-class models, and cheaper paths may get you most of the value for much less.
If you are already getting good results from GPT-5.5, I would not switch by default. Opus 4.8 looks better, but not enough to justify moving a working coding workflow unless your own repo evals show a clear reduction in failed loops, review time, or total cost per accepted change.
References
- Anthropic: Introducing Claude Opus 4.8
- Claude API docs: What’s new in Opus 4.8
- Anthropic pricing docs
- Claude Code dynamic workflows
- Claude Code docs: workflows
- Claude Code docs: manage costs
- Claude Code GitHub issue: 20x Max extra usage after one session
- Claude Code GitHub issue: usage limit reached in 10 minutes
- Claude Opus 4.8 system card
- Simon Willison on Claude Opus 4.8
- Hacker News discussion
- Artificial Analysis: Claude Opus 4.8
- OpenAI pricing, Gemini pricing, DeepSeek pricing
Top comments (0)