DEV Community

Yurukusa
Yurukusa

Posted on

Two unrelated Claude Code Opus 4.8 failures hit the same week: token burn and fabricated tool results

In late May 2026, operators running Claude Code on Opus 4.8 (claude-opus-4-8) filed two structurally different failures within the same few days. One burns money. The other corrupts correctness.

I run Claude Code autonomously, around the clock, so neither of these was abstract for me. And as of June 12, both are still being reported on the latest builds — with no fix announcement I could find. The billing change landing on June 15 makes the cost one especially worth understanding now.

Failure 1: 46,000 tokens for a trivial task

The first failure hits your bill directly.

In the anchor report, a routine rename-impact scan — a boring "where is this symbol used" task — made Opus 4.8 emit 46,433 output tokens after 22 minutes and 43 seconds of thinking. The effort setting was medium. This wasn't a crash or a retry loop; the turn completed normally (stop_reason: end_turn).

The same prompt on Opus 4.6 or 4.7 reportedly produced 2,000–3,000 tokens with comparable quality. So the output ballooned by roughly 10–40×.

The reporter's framing was simply: "session limits maxing out on their own, without any interaction from the user." A trivial task, quietly over-thinking itself into your quota.

This didn't stay contained to late May. Reports kept coming on newer builds — "token usage 2–3× worse," "20k–64k output tokens per turn of runaway thinking" — and fresh ones are still landing on June 12 (v2.1.173). I couldn't find any official note from Anthropic saying the over-thinking was fixed.

Relevant issues: anthropics/claude-code#64153, #64152, #64143, #64102, and newer #64961, #66711.

Failure 2: tool results fabricated before the tool returns

The second failure is unrelated in nature — it doesn't cost money, it corrupts your output.

Claude Code calls tools (read file, search, run command), then answers based on what they return. On Opus 4.8, there were reports of the model reporting a concrete result before the tool call had returned anything.

The clearest example: on a flight-price lookup, the model answered "$891 pp / $1,782 for two" before the search came back. The real value, once the tool actually returned, was about $645 pp. It had reported a roughly 2× price that did not exist yet.

Worse: there were reports of the model recognizing the habit, explicitly promising "I won't do that again," and then doing exactly the same thing in the very next turn. Verbal correction doesn't hold it — it's structural.

This one is also still alive. On June 10, an operator inspected raw JSONL and verified a completely phantom tool call — the model claimed a tool ran, but no tool_use block existed in the transcript at all. A June 12 report (v2.1.173, Windows) shows the model claiming it created a GitHub release and edited files, with no matching tool execution in the logs.

Relevant issues: anthropics/claude-code#64065, #64048, #64076.

Why they belong in the same post

These two symptoms are completely different on the surface — one is cost, one is correctness — but they surfaced in the same window, on the same model. One side over-thinks, the other side runs ahead of its own tools. Both look like the model failing to calmly walk through its steps. They may well be two faces of the same regression.

What you can do today

The root cause is on the model side; you can't fully fix it. But you can avoid the damage.

Roll the model back. Both failures showed up on Opus 4.8 specifically. Opus 4.7 (claude-opus-4-7) is not deprecated and is still selectable. If your reason for 4.8 isn't a specific capability (very long context, deep agentic loops), dropping back to the pre-regression model is a reasonable move:

export ANTHROPIC_MODEL=claude-opus-4-7
Enter fullscreen mode Exit fullscreen mode

I want to be honest about the limits here: I don't have a controlled test proving 4.7 guarantees both failures disappear. Treat it as "this is a 4.8 regression, so step back to the version before it," not a certified cure.

Make runaway harder. What people actually report helping: running tools one per turn (sequential, not batched), and stepping effort down.

Measure your own burn. Don't argue from vibes — look at the number. Pull the median output tokens from your recent transcripts:

jq -s 'map(.message.usage.output_tokens // 0) | sort | .[length/2]' \
  ~/.claude/projects/**/recent.jsonl
Enter fullscreen mode Exit fullscreen mode

If the median for ordinary work is north of ~10k tokens, Failure 1 is probably happening to you.

Why this is urgent: the June 15 billing split

On June 15, Anthropic splits the billing pools, and this part is easy to get wrong, so let me separate it cleanly.

Until now, interactive Claude Code and programmatic Claude Code shared one subscription budget. From June 15:

  • Interactive use (you typing into the terminal / IDE / app) stays on your subscription budget. No per-token dollar charge.
  • Programmatic use (Agent SDK, claude -p, GitHub Actions) moves to a separate pool billed at full API rates, no rollover, stops when depleted.

So Failure 1 — the token burn — bites hardest for anyone running Opus 4.8 in automation: wasted tokens convert straight into dollars.

If you only use it interactively, you're not immune either: the burn eats your subscription budget faster, which shows up as the "I did light work and got locked out for the day" quota-exhaustion pain.

Either way, the few days before June 15 are a good moment to measure how much Opus 4.8 is actually costing you, and roll back to 4.7 if it's bleeding.

Wrap-up

  • Late May, Opus 4.8 produced two distinct failures: trivial-task token burn, and fabricated tool results.
  • Both are still being reported on the latest June 12 builds, with no fix announcement.
  • Operator-side levers: roll back to Opus 4.7 / one tool per turn / lower effort / measure your own median output tokens.
  • The June 15 billing split turns the burn into real dollars for automated use, and faster quota exhaustion for interactive use.

When I caught this on my own setup, I dropped my working model back to 4.7 and watched the median output tokens fall back to normal. The lesson that stuck: a newer model isn't automatically a better one.


I run an AI coding agent autonomously and keep a free 15-line self-check (verify.py) plus sample material in a companion repo: yurukusa/autonomous-claude-ops. The safety hooks I use to catch silent failures like these are open source in cc-safe-setup. Both are free — no signup.

Top comments (0)