DEV Community

RAXXO Studios
RAXXO Studios

Posted on • Originally published at raxxo.shop

Claude Opus 4.8 Is Here: Everything That Changed

  • Opus 4.8 is 4x less likely to let its own code bugs pass unremarked, and abstains when unsure

  • It beats Opus 4.7 on 6 of 7 shared benchmarks and tops GPT-5.5 on most, losing only terminal coding

  • List price held flat while the new Fast mode runs 2.5x faster at a third of the old fast-mode cost

  • Dynamic Workflows fans out hundreds of parallel subagents, but it is a Max, Team, and Enterprise research preview

  • For a daily Claude Code user it is a free quality bump at the same price, not a revolution

Claude Opus 4.8 shipped on May 28, 2026, just 41 days after 4.7. The headline number is a 4.9 point jump on SWE-Bench Pro. That is not the part that changed how I work. The part that changed how I work is that Opus 4.8 now tells me when it is not sure.

The honesty upgrade is the real headline

Anthropic's own framing: Opus 4.8 is around four times less likely than 4.7 to let a flaw in code it wrote pass unremarked. Read that again. The model is not just writing better code. It is catching more of its own mistakes before it hands the work back.

The second half of the story is abstention. Opus 4.8 is more likely to flag uncertainty and less likely to make claims it cannot back up. Simon Willison put it plainly after testing: the lower hallucination rate comes mostly from the model declining to answer when it is unsure, not from getting more answers right. A model that says "I am not certain" is worth more to me than one that is confidently wrong, because confidently wrong is the expensive kind.

Anthropic's staff engineers describe it as a model that asks the right questions, catches its own mistakes, and pushes back when a plan is not sound. I felt this within a day. Fewer silent bugs returned as "done." Fewer moments where I trusted output that did not deserve it.

For a one-person studio that ships every day, this is the whole game. My bottleneck has never been generating code. It has been auditing code I did not write line by line before it goes live. A model that flags its own weak spots gives me that time back. If you want the earlier chapter of this story, Claude Opus 4.7 Is Here covers where the last jump landed.

The benchmarks: where 4.8 wins, and where it loses

Here are the verified numbers, 4.8 against 4.7 and the two closest rivals.

| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |

|---|---|---|---|---|

| SWE-Bench Pro (agentic coding) | 69.2% | 64.3% | 58.6% | 54.2% |

| SWE-Bench Verified | 88.6% | 87.6% | n/a | 80.6% |

| Terminal-Bench 2.1 (terminal coding) | 74.6% | 66.1% | 78.2% | 70.3% |

| OSWorld-Verified (computer use) | 83.4% | 82.8% | 78.7% | 76.2% |

| Humanity's Last Exam (with tools) | 57.9% | 54.7% | 52.2% | 51.4% |

| GDPval-AA (knowledge work, Elo) | 1890 | 1753 | 1769 | 1314 |

| GPQA Diamond | 93.6% | 94.2% | n/a | n/a |

The pattern is clear. Opus 4.8 beats 4.7 on six of the seven benchmarks they share, and it tops GPT-5.5 and Gemini 3.1 Pro on most of them. Knowledge work made the biggest leap, up 137 Elo points on GDPval-AA. Computer use crept forward too, and Anthropic says it hit 84% on Online-Mind2Web, the strongest browser-agent score it has tested. It is also the first model to break 10% on the Legal Agent Benchmark's all-pass standard, a brutal test where finishing most of a task earns nothing and only a complete, correct run counts.

Now the honest part, because numbers without the losses are marketing. GPT-5.5 still wins terminal automation. On Terminal-Bench 2.1 it scores 78.2% to Opus 4.8's 74.6%. And GPQA Diamond actually slipped a hair, from 94.2% to 93.6%. So if your daily work is heavy command-line orchestration, the gap with OpenAI's model narrowed in their favor, not yours. Everywhere else, 4.8 is ahead.

Same price, a much cheaper fast mode

The list price did not move from 4.7. Opus stays at 5 dollars per million input tokens and 25 per million output. A point upgrade at zero extra cost is rare, so I will take it.

What did change is Fast mode. It runs about 2.5 times faster than standard, and it costs roughly a third of the old 4.7 fast tier. Toggle it with /fast in Claude Code. It is still real Opus, not a smaller model wearing the name, which matters when the speed boost is for production work and not just chat.

Two more things that get buried in launch posts. The full 1M token context window is included at standard price with no surcharge. That is the variant this very article was researched on. And the effort control got a quiet retune: in Claude Code the default effort dropped from xhigh to high, and Anthropic reports coding quality went up at comparable token spend. Lower default effort, better results, same bill. That is the kind of change you only notice in the totals at the end of the month.

For anyone running this through the API rather than the app, the plumbing got cheaper too. Batch jobs run at half price, and cached reads land at roughly a tenth of the standard input rate, a 90% saving on repeated context. The Messages API also picked up a useful trick: you can now slot system instructions inside the messages array to update guidance mid-conversation without blowing up your prompt cache, and the minimum cacheable prompt dropped to 1,024 tokens. None of that makes a headline, but it is exactly the kind of detail that decides what an agent actually costs to run all day.

Availability was day one and everywhere that matters. Opus 4.8 went live immediately on the Claude API as claude-opus-4-8, plus Amazon Bedrock, Google Vertex AI, and Microsoft Foundry. The training cutoff stayed at January 2026, the same as 4.7, so this is a reasoning and behavior upgrade, not a fresher-knowledge one.

Dynamic Workflows: hundreds of agents, one script

This is the marquee feature, and full disclosure, it is what produced this article. Dynamic Workflows lets Claude write an orchestration script that fans out tens to hundreds of subagents, runs up to 16 of them at once, and caps at 1,000 per run. It can spin up adversarial agents that try to refute each other's findings before anything is treated as true. State is resumable, so a long run survives an interruption.

I did not just read about this. I researched and fact-checked this post with exactly that pattern: seven parallel agents each chased a different angle, then a second wave of verifiers attacked every single number against independent sources. Twenty-two claims went in, twenty-two came back confirmed, zero contradicted. That is a different kind of confidence than one model answering in one pass.

The catch for solo builders is real. Dynamic Workflows is a research preview, it needs Claude Code v2.1.154 or newer, and it is gated to the Max, Team, and Enterprise plans. So the headline capability exists, but most individual plans cannot fully use it yet. If you have been tracking what Anthropic ships next, the Sonnet 4.8 and Mythos rumor roundup is where I sorted the signal from the noise.

What it means for a one-person studio

If you live in Claude Code, the move is simple: update and keep working. You get the same price, fewer of your own bugs slipping through, and a model that admits uncertainty instead of inventing an answer.

Use Fast mode for interactive sessions where speed beats cost. Lean on the 1M window for whole-repo work, since it is free now. Keep the default effort where it is, because Anthropic already tuned it to spend fewer tokens for better code. The caveats are honest ones: Dynamic Workflows may sit behind a plan tier you are not on, and if your day is terminal automation, GPT-5.5 still has the edge.

The harder question for a solo budget is whether to run the top tier at all when cheaper models exist. My answer used to be "Opus for the hard parts, smaller models for the rest." With 4.8 that math shifts, because the honesty gain is worth the most exactly where mistakes are most expensive: production code, client work, anything I cannot babysit. A cheaper model that bluffs costs me an afternoon of debugging. A pricier model that flags its own doubt saves that afternoon. At an unchanged list price, the integrity upgrade is effectively free, and that is the part of this release I did not expect to care about as much as I do.

I broke down the exact daily routines that make an Opus tier pay for itself in Claude Opus 4.7 in My Solo Studio. Almost all of it carries straight over to 4.8, just with fewer re-audits. Net: this is a free quality upgrade for daily coding, not a revolution.

Bottom line

Opus 4.8 is not even the model Anthropic is most excited about. That is Mythos, teased for "the coming weeks" and sitting one tier above 4.8 while it clears safety validation. But Mythos is not here, and 4.8 is, at last month's price with measurably more integrity.

For me that is the entire pitch. I spend less time second-guessing the output, and the output catches more of its own mistakes. A model that ships fewer silent bugs is worth more to a solo operator than a model that benchmarks two points higher and bluffs. If you want the full system I use to keep a one-person studio shipping at this pace, the Claude Blueprint is the playbook I run every day.

Top comments (0)