Mahima Thacker

Posted on May 29

You Changed One Line and Called It a Migration. Opus 4.8 Has Other Plans.

#ai #claude #model #opus

What actually changes in your backend when you move from Claude Opus 4.7 to 4.8, the 9 things the benchmark screenshots don't tell you.

Most of us ran the same play this week: open the config, swap claude-opus-4-7 for claude-opus-4-8, redeploy, see the request come back 200, move on.

The API contract didn't break, so it felt finished. It wasn't.

Anthropic's migration guide says code that ran on 4.7 keeps running on 4.8 with no changes, and that's true, but it's a statement about the request format, not about behavior.

Your output lengths, token bills, tool-call counts, retries, and where your agent decides to stop can all shift without a single error code.

So for anyone running Opus in production, 4.8 reads less like a "smarter model" headline and more like a list of changed defaults you should check before trusting it with real traffic.

Here's that list:

1. What "no breaking changes" actually covers

It covers the call succeeding. That's about it.
The same prompt can come back longer or shorter, spend more or fewer thinking tokens, call tools more or less often, and cost more or less per request, all while returning 200. If the only thing you checked after the swap is that the endpoint still responds, you haven't really tested the upgrade yet.

2. The effort default quietly dropped from xhigh to high

This is the one most likely to catch you, and almost nobody's mentioning it.

On Opus 4.7 the default effort was xhigh. On 4.8 it's high, on every surface, API and Claude Code included. If you upgrade and change nothing, your high-autonomy and coding agents may now reason less by default than they did the day before.

It's not a regression. high is Anthropic's chosen balance of quality and cost, and they recommend setting xhigh yourself for coding and agentic work. But it's still a behavior change riding along inside a "no breaking changes" release, and the only way you'll notice is by looking.

Two things worth knowing when you set it:

effort goes in output_config, not inside thinking. Put it in the wrong place and you get a validation error rather than a silent fallback.

xhigh isn't new in 4.8. It shipped with 4.7, where it was the default. The change in 4.8 is that you now have to ask for it.

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=64000,                    # leave room to think + act
    output_config={"effort": "xhigh"},   # not inside `thinking`
    messages=[...],
)

The levels run low, medium, high (default), xhigh, max. Use xhigh for coding and long-running agents, step down to medium or low for cheap simple calls, and reach for max only when your own evals show it buys you something over xhigh, since it spends tokens with no ceiling.

3. Adaptive thinking is something you turn on

A common misread is that 4.8 "thinks adaptively now" out of the box. It doesn't. Adaptive thinking is a mode you enable with thinking: {type: "adaptive"}, and it's off until you set it. Once it's on, the model decides per turn whether a step needs real reasoning or a direct answer.

This matters because agent loops mix trivial steps and hard ones. "Read this file" and "refactor this module under these three constraints" don't deserve the same thinking budget, and without adaptive thinking you were often paying for deep reasoning on the trivial steps anyway. With it enabled, 4.8 wastes fewer thinking tokens than 4.7 did at the same effort level.

At high (the default) and max, the model will almost always think. At lower levels it may skip thinking on easy turns. Effort and adaptive thinking are separate knobs that interact, so it's worth setting both deliberately rather than leaving one on its default.

4. 1M context is the default.

That's not a reason to stop retrieving.
Opus 4.8 serves the full 1M-token window by default now, with no beta header and no long-context surcharge. If your client still sends an old context-window beta header for backward compatibility, you can drop it.

What the bigger window doesn't fix: noisy retrieval, duplicate files, stale context, sloppy chunking, missing citations, slow prompts, or a bill that quietly tripled because you started stuffing whole repos into every call.

5. The change you'll actually feel is tool behaviour

If there's one capability improvement to care about, it's this.

Anthropic calls out two things in 4.8: better tool triggering (the model is less likely to skip a tool call the task needed, which some 4.7 users ran into) and better compaction handling (long agentic runs stay on task with fewer derailments after the context gets compacted).

In practice, production agents don't usually fail because the model is dumb. They fail like this:

it should have read the file, but guessed
it should have run the tests, but skipped them
it should have called the API, but assumed the result
it had the constraint earlier in the session, then drifted off it

4.8 is aimed at that class of failure. For agent systems, raw intelligence was rarely the limiting factor. Picking the right tool, calling it at the right moment, using the result, and not losing the thread after compaction is most of what decides whether the thing works, and that's hard to read off a benchmark table but easy to feel in a long session.

6. System messages mid-conversation

Opus 4.8 accepts role: "system" messages placed immediately after a user turn in the messages array (there are placement rules). Earlier models, 4.7 included, rejected that with a 400.

It looks minor and it isn't, because it gives you a way to re-steer a running agent without rebuilding the whole prompt history. The agent is halfway through a refactor, tests start failing, your harness drops in a system entry like "fix the failing tests before continuing," and the agent adjusts.

Since you're not rewriting earlier turns, you keep your prompt-cache hits. Use the top-level system field for instructions that apply from the start, and in-array system messages for course corrections during the run.

7. Prompt caching now works for shorter prompts

The minimum cacheable prompt length on 4.8 dropped to 1,024 tokens, down from 4.7's higher threshold. Prompts that were too short to cache before will cache now, with no code change.

It's worth a few minutes to split the stable parts of your prompt from the dynamic ones so the cache actually earns its keep:

stable, cache it: system prompt, tool definitions, repo map, schemas, policy text
dynamic, don't: the user's question, retrieved chunks, latest tool output

Easy to skip in a demo, expensive to skip in production.

8. Same sticker price, different real cost

Standard pricing is unchanged from 4.7: $5 / $25 per million input/output tokens. But the sticker price isn't the bill. What you actually pay depends on effort level, context size, output length, tool calls, retries, cache hits, and batch usage.

There's also fast mode (research preview, Claude API, speed: "fast"): up to about 2.5x the output tokens per second from the same model, at $10 / $50 per million. Worth being precise here, because it's easy to misread: fast mode is a premium, roughly 2x the standard rate, not a discount. It's only cheaper compared to fast mode on the older 4.6/4.7 models, which ran $30 / $150.

The honest framing is that fast mode finally costs something reasonable for latency-critical paths, not that it got cheap.

The real cost decision is routing: which requests need 4.8, which can go to a smaller model, which run in batch, which justify fast mode.

9. Don't let "new model, must be better" replace your evals

This is the part I'd argue for hardest, and the rest of this post is why. Each change above is a thing to actually measure on your own workload before you trust 4.8 with production traffic:

The effort drop (§2): compare your hardest agent task at high vs xhigh. If you upgraded and left the default, did reasoning depth or token use quietly fall?

tool triggering (§5): count skipped-but-needed tool calls across your agent traces, 4.7 vs 4.8. This is the one most likely to have improved, so it's the one worth proving.

caching (§7): did your cache hit-rate change now that the floor is 1,024 tokens?

cost and latency (§8): track output tokens, tool-call counts, retries, and p95, and watch which direction each moved.

The question isn't whether 4.8 is better in the abstract. It's whether it's better for your workload under your cost and latency budget, and that's the one your evals can answer.

Bottom line

For casual chat, 4.8 just feels like a slightly better Claude. For anyone building agents, coding tools, RAG, or automation, the interesting changes aren't in the headline number.

They're in the defaults and primitives: opt-in adaptive thinking, the effort default moving back to high,fewer skipped tool calls, calmer long-running agents, 1M context by default, mid-conversation system messages, a lower cache floor, and fast-mode pricing that makes more sense than it used to.

So treat the model swap as the first step, not the whole job. Re-set xhigh for your agents before you forget, re-baseline your prompt caching, and run your evals before you flip production over, not after the bill shows up.

If you've already done the 4.7 to 4.8 move and hit something I didn't list here, leave it in the comments. The silent-default stuff is usually what you only find out the hard way.

Sources: Anthropic's Opus 4.8 migration guide, the Effort and Adaptive Thinking API docs, and the Opus 4.8 announcement (May 28, 2026).

DEV Community