Mixture of Experts

Posted on May 29

Claude Opus 4.8 Is About Reliability

#ai #coding #claude #programming

Anthropic shipped Opus 4.8 into Claude Code with a familiar promise: better agentic coding. Does it make real developers more confident leaving Claude Code alone on production-shaped work?

TL;DR

Anthropic calls Opus 4.8 a “modest but tangible improvement.” That is the right frame.
The coding numbers are better, especially on harder agentic benchmarks, but they do not settle the model-ranking argument by themselves.
Claude Code quality still depends on the harness: effort level, context compaction, prompt cache behavior, tool permissions, and launch stability.
Pricing is premium. The right metric is not dollars per million tokens. It is dollars per accepted engineering outcome.
My read: use Opus 4.8 for hard, multi-step work where a failed agent loop costs real time. Do not use it for cheap bulk edits by default.

What changed technically

The model details are developer-relevant. Anthropic lists Opus 4.8 at a 1M token context window on Claude API, Bedrock, and Vertex AI, with 128K maximum output tokens. Microsoft Foundry is capped lower at 200K context.

The Messages API now accepts system entries inside the messages array, which means instructions can be changed mid-task without breaking the prompt cache in the same way. That sounds small, but it is exactly the kind of feature that matters for long-running coding agents: “plan first,” “now patch,” “now review your patch under stricter rules.”

Claude Code also gets dynamic workflows in research preview for Max, Team, and Enterprise plans. Anthropic describes this as Claude planning work and launching many parallel subagents. The headline example is Jarred Sumner using it on the Bun Zig-to-Rust port: roughly 750,000 lines of Rust, 99.8% of the existing test suite passing, and 11 days from first commit to merge.

What developers are reporting

Simon Willison highlighted the mid-conversation system prompt feature as practically interesting, and posted a small cost example where the best result used 25 input tokens and 17,167 output tokens, costing about 43 cents.

Hacker News reaction is mixed in the usual useful way. Some developers see benchmark fatigue: Opus 4.6, 4.7, and 4.8 all claim improvements, but day-to-day coding gains are harder to feel. Others argue the current coding benchmarks miss the parts of software engineering that hurt: unclear requirements, repo-specific conventions, migrations, flaky tests, and review cost. A recurring practical tip was to set Claude Code effort to xhigh for serious work. Another thread reported launch-day Claude Code breakage around thinking blocks.

Theo Browne’s developer-focused take is that Opus 4.8 is a “modest but tangible improvement,” especially for TypeScript-heavy work and “Claude special” UI tasks, but not a reason to ignore the old Claude Code risks. He treats benchmark wins like SWE Bench Pro cautiously, still sees GPT-5.5 xhigh as stronger in his mini-SWE-agent harness, and warns that dynamic workflows / “Ultra Code” can turn Claude into a powerful parallel coordinator for audits, bug hunts, and migrations while also burning money absurdly fast. His practical advice is to write detailed prompts up front, keep a root CLAUDE.md, monitor spend with CC usage, resume from summaries when limits hit, and verify everything, because Opus 4.8 can still hallucinate details like CLI flags.

Pricing: what it costs in reality

The obvious objection is price.

Model/path	Sticker price	Practical read
Claude Opus 4.8	$5 input / $25 output per MTok	Expensive, plausible for hard agent loops
GPT-5.5	$5 / $30 short context; $10 / $45 long context	Similar frontier tier, output can cost more
Gemini 3.5 Flash	$1.50 / $9	Better default for cheaper high-volume work
DeepSeek V4	Much cheaper	Strong cost pressure; workflow quality varies

Anthropic also offers batch pricing at half price: $2.50 input / $12.50 output per MTok for Opus 4.8. Prompt caching is $6.25/MTok for five-minute cache writes, $10/MTok for one-hour cache writes, and $0.50/MTok for cache hits. Fast Mode for Opus 4.8 is $10 input and $50 output per MTok for up to 2.5x higher output tokens per second, which is cheaper than the $30 input / $150 output Fast Mode pricing listed for Opus 4.6 and 4.7. But there is a catch: Anthropic says Opus 4.7 and later may use up to 35% more tokens for the same fixed text because of the tokenizer.

The real pricing question is not whether Opus 4.8 is expensive per token. It is whether it reduces failed or supervised loops enough to justify being on the critical path. If it prevents one botched refactor pass, it can be cheaper than a lower-priced model that burns context, leaves a half-correct diff, and hands the cleanup back to you.

Dynamic workflows complicate this. Parallel subagents can multiply progress. They can also multiply spend. Anthropic’s own docs warn that a workflow can use “meaningfully more tokens” than a normal conversation, counts against plan usage and rate limits, and can fan out to as many as 16 concurrent agents and 1,000 agents total per run. Public Claude Code issue reports make the downside less theoretical: Max users have reported hitting “out of extra usage” after one task and 155 tool uses in 9.5 minutes, and another Opus report claimed the limit arrived in roughly 10 minutes after about 20 prompts. Use workflows where the work decomposes cleanly and the result can be verified with tests, not as the default path for every substantial request.

Where Opus 4.8 fits

Artificial Analysis reports Opus 4.8 as a top-tier model, with a high Intelligence Index score and unusually heavy token usage during evaluation.

I would route Opus 4.8 to:

multi-file planning and risky refactors,
hard debugging across services,
codebase exploration before architectural decisions,
security-sensitive review with strict instructions,
long Claude Code sessions where recovery matters.

I would route cheaper models to:

rote edits,
test scaffolding,
formatting,
classification,
bulk summarization,
low-risk subagents.

The split is simple: use Opus where failure is expensive. Use cheaper models where retrying is cheap.

Practical verdict

Try Opus 4.8 immediately if you already pay for Claude Code and have high-value work that currently fails because the agent loses the plot. Set effort deliberately. Demand tests. Review diffs like you would review a fast junior engineer with infinite stamina and uneven judgment.

Wait if your work is mostly small patches, hobby automation, or batch code cleanup. Gemini, DeepSeek, Sonnet-class models, and cheaper paths may get you most of the value for much less.

If you are already getting good results from GPT-5.5, I would not switch by default. Opus 4.8 looks better, but not enough to justify moving a working coding workflow unless your own repo evals show a clear reduction in failed loops, review time, or total cost per accepted change.

References

Top comments (1)

Self-Correcting Systems • Jun 1

Strong read. The line that matters most to me is “dollars per accepted engineering
outcome,” not dollars per token.

That is the right unit for coding agents.

A cheaper model that creates three half-correct passes, burns review time, and leaves the
repo in a confused state can be more expensive than a premium model that gets one hard
loop right. But the reverse is also true: using Opus-class models for rote edits is
usually wasteful.

The harness point is the real lesson. Model reliability does not exist by itself. It
depends on context discipline, permissions, tests, repo conventions, prompt structure,
and whether the agent has a way to recover when it loses the thread.

That connects directly to memory and authority too. A more reliable model with bad
context can still obey the wrong instruction confidently. Better reasoning does not
remove the need for verification gates, scoped instructions, and clear authority rules
inside the repo.

My practical split would be similar:

Use frontier models for ambiguous, multi-step, high-review-cost work.

Use cheaper models for repeatable, low-risk execution.

Then measure the whole loop: accepted diff rate, review time, rollback rate, test pass
rate, and total cost per merged change.

That is the only ranking that really matters in production-shaped coding.