LayerZero

Posted on May 28

Claude Opus 4.8 didn't raise the price. It raised the default. Here's what `effort=high` does to your bill.

#claudecode #anthropic #ai #agents

Anthropic shipped Claude Opus 4.8 on Thursday. The price didn't move: $5 per million input tokens, $25 per million output, same as 4.7.

Then they changed one default. effort now ships set to high — on the API, in Claude Code, in the web app, everywhere.

Your per-token price is flat. Your per-task token count is not. Open your dashboard Monday and you'll see it.

What actually shipped

Here's what landed on May 28, 2026, stripped of the launch-post adjectives:

Same headline price. Opus 4.8 is $5/M input and $25/M output — identical to 4.7. Anthropic led with this, and it's true.
Fast mode repriced. Fast mode runs at 2.5× the output speed and costs $10/M input, $50/M output. Anthropic's framing: "3× cheaper than fast mode was for previous models." Read that again — it's 3× cheaper than the old fast mode, not 3× cheaper than standard. Fast mode is still 2× the price of standard Opus.
effort defaults to high. This is the buried one. The effort parameter — high, xhigh, max — controls how many reasoning tokens the model spends before it answers. On 4.8 it defaults to high on every surface. You can set it down. The default does not.
Dynamic Workflows (research preview). Claude can now plan a task and spawn "hundreds of parallel subagents in a single session," pitched at "codebase-scale migrations across hundreds of thousands of lines of code."
A Messages API change. You can now inject system entries mid-array, mid-task, without breaking the prompt cache. One line in the changelog. It's the most quietly useful thing in the release.
Honesty. Anthropic says 4.8 is "around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked," and broadly less likely to bluff.
Benchmarks. Terminal-Bench 2.1: 86.5%. OSWorld-Verified: 84.0%. Finance Agent v2: 72.4%. Online-Mind2Web: 84%.
The competitive framing. Anthropic claims 4.8 is the only model to complete every case end-to-end on its Super-Agent benchmark, "beating GPT-5.5 at parity on cost." The phrase "at parity on cost" is doing real work — the pitch is no longer "smarter," it's "smarter for the same dollar." That's a tell about where the whole market is now competing.

That's the release. Now the part the launch post won't do for you: the math.

Why this lands on your invoice, not your changelog

"Same price as 4.7" is the headline. It's also the misdirection.

Price is dollars per token. Your bill is dollars per token, times tokens per task, times tasks per month. Anthropic froze the first number and raised the second one for you, by default.

effort=high means more reasoning tokens per call. Those are output tokens. Output tokens are the $25 side of the meter, not the $5 side. A task that cost you 4,000 output tokens of thinking on a lower effort setting can cost 12,000 on high — same model, same prompt, same price-per-token, 3× the line item.

Run it as a number, because that's the only way this argument is honest. Say you run a support-triage product: 500,000 agent calls a month, each with a 2,000-token cached prompt and a roughly 800-token answer. On a medium-equivalent reasoning budget, call it 1,500 output tokens per call all-in. At $25/M output that's 500,000 × 1,500 / 1,000,000 × $25 = $18,750/month on the output side. Flip every one of those calls to high and the reasoning budget jumps — say 4,000 output tokens per call. Same arithmetic: 500,000 × 4,000 / 1,000,000 × $25 = $50,000/month. You did not change a line of code. You did not change the model. You changed nothing — and your output bill went from ~$19k to ~$50k because the default moved under you. That $31k/month delta is the entire subject of this post.

Here's where you sit today:

You run Opus through the API in a product. Your unit economics just changed and you didn't ship anything. Every customer request now defaults to high effort.
You run Claude Code on a team. Every developer's every prompt now defaults to high effort. Multiply by headcount and working days.
You were about to turn on Dynamic Workflows. Hundreds of subagents is hundreds of parallel billing streams. Read the next section before you flip it.
You're a CTO who approved a Claude budget in Q1. That budget was sized on 4.7 defaults. It's now wrong.

The model got better. That part is real and I'll defend it. But "better" arrived bundled with "more expensive per task," and the bundle is invisible unless you read past the headline.

(LayerZero writes for people running AI in production, not testing it on weekends. One post a week. Subscribe if you ship.)

The mechanism: three changes that move your token count

Understand them in order.

1. The effort parameter is a token multiplier with a dial.

effort controls the reasoning budget — how long the model thinks before it commits to an answer. Higher effort, more thinking tokens, better answers on hard tasks, and a bigger output-token bill on every task, hard or trivial.

The trap is that high is now the floor you start from, not a setting you opted into:

# 4.8 behavior: high is the default; you have to lower it
client.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    effort="high",   # <- this is now implicit if you omit it
    messages=[...],
)

For a "classify this ticket into one of five buckets" call, high effort is pure waste — you're paying for a paragraph of reasoning to produce a one-word answer. For a "plan this migration" call, it's worth every token. The model can't tell the difference. You have to.

The asymmetry is the whole game. On a hard task, the marginal reasoning tokens buy you a real accuracy gain — that's the case Anthropic tuned the default around, and they're right that most hard tasks want high. But production traffic isn't mostly hard tasks. It's mostly classify, extract, route, summarize — high-volume, low-stakes calls where the extra reasoning changes the answer in well under 1% of cases and changes the bill in 100% of them. The default optimizes for the 5% of your calls that are hard and taxes the 95% that aren't. That's a fine default for a research demo and a terrible one for a high-volume product, which is exactly why you can't leave it implicit.

2. Dynamic Workflows multiply tasks, not just tokens.

A single Dynamic Workflow session can fan out into hundreds of subagents. Each subagent is its own context, its own reasoning budget, its own meter. The pitch — migrate hundreds of thousands of lines of code in one session — is real and genuinely impressive. It is also a billing pattern you have never had before: one human action, hundreds of parallel agent invocations, all defaulting to high effort.

If the task genuinely parallelizes — a mechanical migration across 400 files — this is a bargain versus a human doing it. If it doesn't — you fanned out 200 subagents to "explore the codebase" and 180 of them re-read the same five files — you just paid for 180 redundant context loads to get the answer one agent would have found.

3. The Messages API cache change quietly lowers cost — if you use it.

This is the one change that cuts the bill instead of raising it. You can now insert system entries mid-conversation without invalidating the prompt cache:

# Before: injecting an instruction mid-task busted the cache,
# re-billing the full prefix at the uncached input rate.
# After: append a system entry in-array, keep the cache hit.
messages.append({
    "role": "system",
    "content": "Constraint update: the user is now on the EU data plane. "
               "Do not call tools that route through us-east.",
})
# prompt cache stays warm; you pay cached-input rates on the prefix

For any long-running agent that updates its own instructions mid-task — which is most production agents — this is a real cut to the input side of the meter. Almost nobody will notice it, because it's one line in the changelog and it doesn't have a demo.

The opposing view: "the smart default is smart"

The reasonable counter, and you'll hear it from someone on your team by Tuesday:

"You're complaining that the smart default is smart. high effort gives better answers, the model's four times less likely to ship a bug, and fast mode is cheaper than it's ever been. Anthropic tuned the defaults for quality. Stop optimizing for a 30% token saving on tasks that don't matter."

This isn't wrong. For a lot of teams, the right move is to leave high on and ship better output. If your Opus spend is $400/month, chasing effort tuning is a waste of an engineer's afternoon — the juice isn't worth the squeeze, and the quality bump is free leverage.

And the honesty improvement is not marketing fluff. "Four times less likely to let its own code flaws pass" is the kind of change that shows up as fewer 2am incidents, and that's worth more than the token delta to most teams. If high effort is part of what produces that honesty gain — and it almost certainly is, since more reasoning is how the model catches its own mistakes — then turning effort down on your code-review path to save tokens is penny-wise and incident-foolish. The counter-argument has a real point here: there are paths where you want the expensive default, and code generation is the obvious one.

But two things stay true. First, the benchmark numbers — 86.5% on Terminal-Bench, 84% on OSWorld — are Anthropic's evals on Anthropic's task mix. They are a reason to test, not a reason to trust. The pre-launch skeptics who said "treat the claims as unconfirmed until you run your own evals" were right then, and they're right now; the only thing that changed is the claims are official. Second, "the default is smart" and "the default is free" are different sentences. The default is smart. It is not free. The teams that get hurt are the ones who hear the first and assume the second.

The playbook: five moves, in order

What I'd do on Monday.

1. Pin effort per task class, not per app

Stop letting high be implicit. Route effort by what the task is worth:

EFFORT_BY_TASK = {
    "classify":    "low",     # one-word answer, no reasoning needed
    "extract":     "low",
    "summarize":   "medium",
    "code_review": "high",    # worth the thinking tokens
    "migration":   "max",     # rare, high-stakes, parallelized
}

def call(task_type, messages):
    return client.messages.create(
        model="claude-opus-4-8",
        effort=EFFORT_BY_TASK.get(task_type, "medium"),
        max_tokens=4096,
        messages=messages,
    )

This one table is the highest-leverage change in this post. Most production traffic is classify/extract/summarize, and most of it does not need high. Pinning effort by class is where the bill actually moves.

Two notes that save you a week. First, make the default in .get() medium, not high — so a task type someone forgot to register degrades to "reasonable," not "expensive." The implicit failure mode should be cheap. Second, log the task_type alongside token usage in whatever you use for spend tracking. When the bill moves, you want to answer "which task class moved it" in one query, not one afternoon. The teams that survive a cost spike are the ones who can attribute spend to a task type; the ones who can't spend the spike and the investigation.

2. Cap Dynamic Workflows before you enable them

Treat subagent fan-out like a recursive function with no base case: put a ceiling on it before it runs in prod. Cap the subagent count, scope the file set explicitly, and log per-subagent token spend so a runaway shows up in your dashboard, not your invoice. If your harness doesn't expose a subagent cap yet, don't enable Dynamic Workflows on production credentials until it does.

Do the back-of-envelope before the first run. A 200-subagent fan-out, each subagent burning ~30,000 tokens of context plus reasoning at high, is 6 million tokens for one session — call it $30–$150 depending on the input/output split, for a single human "go." That's cheap if it migrated 400 files you'd have paid an engineer two days to touch. It's a fire if it was a glorified search you could have done with one agent and a grep. The feature is priced like a power tool, and like a power tool it removes a finger when you point it at the wrong job. Set the cap to the number of genuinely independent units of work, not to "however many it wants."

3. Decide fast mode with arithmetic, not vibes

Fast mode is 2× the standard price for 2.5× the speed. The question is never "is fast mode worth it" in the abstract — it's "is this specific latency worth 2× the tokens." For an interactive coding session where a developer is blocked and waiting, 2× cost to unblock a $150k engineer 2.5× faster is trivially worth it: the engineer's loaded hourly rate dwarfs the token delta, and the math isn't close. For an overnight batch job that no human is watching, paying 2× for speed nobody experiences is setting money on fire. Tag your workloads interactive or batch and let that decide, not the developer who likes the snappy feel.

The trap inside the trap is Anthropic's framing. "3× cheaper than fast mode used to be" is true and irrelevant to your decision — you're not choosing between today's fast mode and last year's, you're choosing between fast and standard today, and today fast is 2× standard. Anchor on the comparison you're actually making, not the one in the launch post. The historical-discount framing is designed to make fast mode feel like the default; resist it. Fast mode is an opt-in for latency-sensitive paths, not a free upgrade.

4. Run your own eval before you trust the honesty number

"4× less likely to let code flaws pass" is a claim about Anthropic's test set. Before you remove a human review step because the model is "more honest now," run your own regression set — your code, your failure modes — and measure the delta yourself. If you don't have an eval set, that's the project, not the model upgrade. The cheapest version of this: take the last 50 bugs that shipped past your current review process, feed each diff to 4.8 at high, and count how many it flags. If it catches 40 of 50, you have a real second reviewer and can reallocate human attention. If it catches 12, the honesty number doesn't transfer to your codebase and you just saved yourself a very expensive false sense of security.

5. Adopt the mid-array system cache change

If you run a long-lived agent, refactor mid-task instruction updates to use in-array system entries instead of restarting the conversation. This is a straight cost cut on the input side with no downside. Most teams won't even need a real refactor — it's a one-line difference in how you append the message, not a rearchitecture. It's the rare change that's all upside — take it.

If you ship interactive developer tooling, leave effort high and pin fast mode by workload (moves 1 and 3). If you ship a high-volume API product, pin effort low-by-default and cap workflows hard (moves 1 and 2). Same release, opposite playbook.

(If your Claude bill jumped this week, the effort default is the first place to look.)

When it breaks

The playbook closes most of the gap. Three places it doesn't.

max effort on a tight loop. Someone sets effort="max" because "max is best," wires it into a retry loop, and a transient tool error triggers three max-effort retries per request. The bill spikes 9×, the dashboard shows "normal request volume," and you spend a day finding it. Mitigation: ban max outside of explicitly human-triggered, rate-limited paths.
Dynamic Workflows on a non-parallel task. You point hundreds of subagents at a problem that's actually sequential — each step depends on the last. They can't parallelize the dependency, so they thrash, re-read, and burn tokens producing a worse answer than one focused agent. Mitigation: only fan out when the work is genuinely independent across units. If step N needs step N-1's output, subagents are the wrong tool.
Trusting the honesty delta on out-of-distribution code. The 4× number is on Anthropic's eval mix. On your weird legacy COBOL-to-Kotlin bridge, the delta may be smaller or gone. Mitigation: the honesty improvement lowers your review burden; it doesn't remove it. Keep a human in the loop on the code paths where a missed flaw costs you a customer.
The effort default drifting back in after you pin it. You pin effort everywhere, the bill drops, everyone moves on. Three months later someone adds a new endpoint, copies a snippet that omits the parameter, and that path silently runs at high. One forgotten call site doesn't move the monthly number enough to notice — until that endpoint goes viral and it's 60% of your traffic. Mitigation: enforce it in code, not discipline. A thin wrapper that requires an explicit effort argument and refuses to call the API without one turns "we forgot" into a failed lint, not a surprise invoice. Make the cheap path the only path that compiles.

The non-obvious takeaway

For two years the model release cycle was a price war. Each version, more capability per dollar, and the per-token number kept falling. We all learned to read a release by checking the price line.

4.8 ends that frame. The per-token price didn't move — it can't keep falling forever, and Anthropic just told you so by holding it flat. The competition moved up a layer: from price-per-token to tokens-per-task, and the effort parameter is the lever.

The cost story of 2026 is not "which model is cheaper per token." It's "who tuned their effort budget to the task."

Look at the Super-Agent claim again through this lens: "beats GPT-5.5 at parity on cost." Anthropic isn't selling you a smarter model anymore. It's selling you the same dollar spent better. When the vendor's own headline metric is denominated in cost-parity, the era of "just wait for the next model to get cheaper per token" is publicly, officially over. They told you. Most people read past it.

Here's the bet I'll defend in 90 days: by Q3 2026, every serious model provider ships an effort-style dial, and "effort tuning" becomes a named skill the way "prompt engineering" was in 2023 and "eval engineering" is becoming now. The teams that win on margin won't be the ones on the cheapest model. They'll be the ones who routed low effort to 70% of their traffic and saved max for the 5% that earns it. The per-token price war is over. The per-task spend war just started, and most teams don't know they're in it.

This week

Three things before Friday.

Grep your codebase for claude-opus-4 calls. Add an explicit effort to every one. Don't leave a single implicit high in production. The act of choosing forces the question "what is this task worth," which is the question that moves the bill.
Pull your last 7 days of Opus spend and project it forward at the new default. If you're on 4.8 already, compare this week to last. If the output-token line jumped, you found your effort problem. Bring the number to whoever owns the budget before they find it themselves.
Do not enable Dynamic Workflows on production credentials until you've set a subagent cap and a per-session token ceiling. Try it on a sandbox key first. Watch one real migration. Then decide.

Bet your CFO reads the effort config before they read the model card. Build for that reader.

DEV Community