Simon Paxton

Posted on Apr 11 • Originally published at novaknown.com

Claude Code Lost Its Thinking Budget: and Broke Workflows

#claude #anthropic #ai #openai

A few weeks ago, one of the most useful facts about Claude Code stopped being visible.

Not the code it wrote. The thinking behind it. And according to a GitHub issue filed by AMD AI director Stella Laurenzo, that change lined up with a sharp Claude Code regression in real engineering work: analysis across 6,852 sessions, 234,760 tool calls, and 17,871 thinking blocks found that as thinking redaction rolled from 1.5% to 100% over about a week, the tool shifted from research-first behavior to edit-first behavior and became much less trustworthy on complex tasks.

That is the real story here.

The easy version is "Anthropic shipped a bad update." The harder version is more useful: coding agents fail when you remove the invisible parts of the system users were relying on all along. If your team treats an AI coding assistant like a chatbot with better autocomplete, you will miss the failure until it lands in your repo.

Claude Code’s degradation is not just a bug

Laurenzo's issue makes a specific claim. Starting in February, Claude Code got worse at senior engineering work, not in a vague "feels worse" way but in its behavior: less code reading before edits, more whole-file rewrites, more instruction violations, more sessions that stopped mid-task, more cases where the model claimed completion without actually doing what was asked.

That matters because these are not cosmetic errors. They are workflow errors.

A model that reads three times less before editing is not just faster and sloppier. It is using a different strategy. In a large codebase, "read less, edit sooner" is the exact failure mode that turns a useful assistant into an intern who opens files at random and starts typing confidently.

TechRadar's report says Laurenzo also tracked stop-hook violations rising from zero in early March to roughly 10 per day later on. Even if you ignore every angry Reddit comment, that pattern is operationally ugly. A coding agent that breaks your process hooks is not misbehaving at the margin. It is breaking contract.

The most important line in the GitHub issue is not the complaint. It is the framing: extended thinking is load-bearing for senior engineering workflows.

That is a much bigger claim than "this release had regressions."

It means the thing users thought was optional polish was actually structural. Remove it, and the system doesn't degrade smoothly. It changes shape.

We've seen this pattern before in a different form. In AI Builds AI: How Anthropic’s Claude Codes Its Future, the interesting part wasn't that AI writes code. It was that once AI enters the production loop, the tool itself becomes part of the engineering system. That means product tweaks are no longer UX tweaks. They are changes to the reliability of a teammate.

Why thinking redaction changed the tool’s behavior

A lot of people hear "thinking redaction" and assume the problem is merely that users lost transparency.

I don't think that's the whole thing.

According to the AMD analysis, the rollout of redacted thinking content matched the quality regression almost exactly, with March 8 called out as the point where redacted thinking crossed 50% and the drop became obvious. Correlation is not proof of mechanism. But the mechanism here is not mysterious.

Reasoning budget is behavior budget.

When an agent has room to think, it tends to do boring senior-engineer things first: inspect files, cross-check conventions, look up prior patterns, trace dependencies, hesitate before changing a hot path. When that budget gets squeezed—or reallocated, or hidden behind product logic—the model doesn't just become "a bit less smart." It starts taking a different path through the task.

That is why the user complaints sound so strange and so familiar at the same time. "Ignores instructions." "Does the opposite." "Claims completion." Those are not random defects. They are what it looks like when a system is still fluent but no longer doing enough internal search before acting.

The Reddit "car wash test" gets at this in a crude but useful way. Ask whether you should walk or drive to a car wash 50 feet away. A thin reasoning pass grabs the nearest heuristic—walk, because it is close. A thicker one notices the missing constraint: the car has to be there too. Same language. Different search depth.

Coding agents live or die on that difference.

This is also why hidden reasoning is dangerous even when the output still sounds polished. A model can produce clean prose, valid syntax, even plausible diffs while having done much less internal work. Builders often catch that late, because style survives longer than judgment.

The best objection is that redaction should only hide the chain of thought, not reduce the model's actual reasoning. In theory, yes. In practice, product systems are messy. Once you start optimizing for latency, cost, privacy, abuse prevention, and capacity at once, "hide the thoughts" can easily become "use fewer of them," or "route differently under load," or "compress the scratchpad until the behavior changes." Users don't need the exact internal implementation to notice the result.

They noticed.

The real lesson: coding agents need visible reasoning

Not visible in the sense that every token of chain-of-thought should be dumped to the user.

Visible in the sense that the budget and mode need to be inspectable.

If a coding agent is running with shallow reasoning, teams should know. If it is under capacity pressure, teams should know. If a session is being served with a smaller effective budget than yesterday, teams should know. Right now many tools present all of those states with the same glossy interface, which is absurd.

Imagine a CI system that silently ran only half your tests when load spiked.

You would not call that a nice product simplification. You would call it a broken contract.

That is what happened here. The outrage is not really about hidden thoughts. It is about hidden operating conditions.

This is why benchmark discourse keeps missing the point. Benchmarks mostly ask, "Can the model solve this task?" Teams need to ask, "Under what reasoning budget, tool policy, latency target, and failure mode did it solve it?" Those are different questions. The second one predicts whether your sprint survives contact with production.

We already have a close analogy in data systems. In AI Model Collapse Is Happening: Treat Data as Code Now, the practical lesson was that invisible upstream changes poison downstream trust. This is the same pattern, except the hidden dependency is not data provenance. It is inference provenance.

And once you see that, a lot of "model quality feels unstable" complaints stop sounding emotional. They sound like ops.

What teams should trust less, and what to measure instead

Teams should trust vibes less.

That sounds obvious. It isn't. Most teams still evaluate an AI coding assistant the way consumers evaluate headphones: use it for a week, notice whether it feels sharp, then decide whether to renew. That is fine for drafting emails. It is reckless for tools making edits in a live codebase.

What should replace vibes is simple instrumentation.

Here is the minimum useful set:

Metric	What it catches	Why it matters
Read-before-write ratio	Edit-first drift	Tells you whether the agent is inspecting enough context before changing code
Whole-file rewrite rate	Loss of local precision	Spikes when the model stops making surgical edits
Task abandonment / stop-hook violations	Workflow instability	Catches the point where the tool stops being reliable in automation
Instruction adherence	Planning collapse	Measures whether the model actually followed the requested constraints
Time-to-useful-diff	False efficiency	Separates "fast output" from "fast progress"

A table like this is boring.

That is why it works.

If you track these across versions, providers, and time of day, you can spot regression before your developers turn into amateur anthropologists of model mood. You can also compare tools on the thing that matters: not benchmark brilliance, but operational steadiness.

This is where Reduce LLM Hallucinations? Why ‘Make-No-Mistakes’ Fails becomes relevant. The trap is trying to force a model into never making visible mistakes, then being surprised when it gets evasive or shallow. Systems tuned to look safe often become worse at the kind of exploratory reasoning complex work needs. You can push errors out of sight without reducing them.

AMD's reported switch away from Claude Code is significant for exactly this reason. Not because one vendor lost one customer. But because a sophisticated team decided the adoption risk was higher than the switching cost.

That is the threshold vendors should fear.

Once teams conclude that quality depends on invisible throttles they cannot inspect, they stop building process around the tool. And the moment that happens, the vendor loses the compounding advantage of workflow lock-in.

Claude Code is a warning for every agent product

It would be comforting to treat this as a Claude-specific stumble.

I don't think it is.

Every agent product is built on hidden budgets: context windows, reasoning tokens, routing policies, tool permissions, rate limits, latency targets, safety filters, fallback models, queue pressure. Users mostly see one box and one promise. But underneath is a negotiated truce between capability and cost.

That truce breaks under growth.

The Reddit thread is full of guesses about capacity constraints and silent downgrades. Those specific explanations are unverified. The broader pattern is not. As products scale, the pressure to smooth demand and reduce expensive inference rises. If the easiest way to do that is to quietly narrow reasoning depth while preserving surface fluency, many companies will be tempted to do it.

That is why Claude Code regression should be read as a design warning, not just a brand embarrassment.

Agent products need service-level concepts for reasoning. Not just uptime. Not just latency. They need something like: minimum reasoning mode, visible tool policy, stable model identity, and regression reporting that maps to actual workflows. If vendors won't expose that, enterprises will either instrument around them or move to providers that do.

And some will go local faster than expected.

Not because local models are better today. Because a worse model you can control is often more useful than a better one whose cognition changes behind glass.

Key Takeaways

Claude Code appears to have regressed in complex engineering workflows, and AMD's GitHub issue gives that claim concrete behavioral evidence rather than just frustration.
The important failure is not "the model got dumber." It is that reduced or hidden reasoning changed the agent from research-first to edit-first behavior.
Coding agents are operational systems. Hidden shifts in reasoning budget, tool policy, or routing can break workflows even when outputs still look polished.
Teams should measure read-before-write ratio, rewrite rate, instruction adherence, and stop-hook violations instead of trusting general impressions.
The broader risk is vendor opacity: once reasoning conditions become unstable and invisible, adoption risk rises faster than benchmark scores can compensate.