Luhui Dev

Posted on Jun 12

Claude Code Incident Review: What Anthropic's Three Production Bugs Teach Agent Engineers

#luhuidev #ai #claude

Intro

Last month, Anthropic published a rare kind of incident review.

The rare part was not that they had bugs. If you build large-model products, bugs are part of the deal.

The rare part was that they wrote up three production incidents in detail: how each one was introduced, why testing missed it, why it was hard to reproduce internally, and what they changed afterward.

After reading it, I think the review is worth studying closely. If you build LLM Agents, especially systems with multi-turn tasks, tool calls, context compression, and reasoning trace management, these failures are not edge cases. They are waiting on the road.

Three Bugs, Three Failure Modes

Here is the short version.

Bug one: On March 4, to address occasional UI freezes in Opus 4.6 under high reasoning mode, the team changed the default reasoning effort from high to medium.

Internal testing looked fine: intelligence dropped only slightly, and latency improved a lot.

After launch, users pushed back hard: Claude felt dumber.

On April 7, one month later, the change was rolled back.

Bug two: On March 26, Anthropic shipped a cache optimization. The idea was simple: after a session had been idle for more than an hour, clear old thinking history to reduce the cost of resuming.

Sounds reasonable, right?

The production implementation had a bug. It was supposed to clear the old thinking once. Instead, it kept clearing it on every later turn.

So Claude kept working while repeatedly losing the memory of why it was doing the work. Users saw forgetting, repetition, strange tool calls, and increasingly odd behavior.

Worse, once thinking blocks kept disappearing, each request became a cache miss and burned through usage limits faster.

The root cause was not identified until April 10, two full weeks later.

Bug three: On April 16, to reduce verbose output from Opus 4.7, the team added a line to the system prompt:

Use no more than 25 words between tool calls, and no more than 100 words in the final response.

That prompt line showed no obvious regression during weeks of internal testing.

After launch, coding quality dropped by 3%.

On April 20, it was rolled back.

These three bugs look different, but they point to the same issue: in an Agent system, things that look local, such as parameters, caches, and prompt lines, can still affect the core execution logic.

Touch them, and you may be touching the model's brain.

Reasoning History Is Working Memory, Not a Log

The second bug is the one I keep coming back to.

"Clear old thinking to save tokens" is a perfectly normal engineering optimization. Thinking blocks are long and expensive. If a session has been idle for an hour, the old reasoning chain can look less important.

But that is exactly the trap.

For an Agent, the reasoning trace is not just a log. It does not merely record what happened. Its more important job is to preserve why the Agent made earlier decisions.

That why is what lets a multi-turn task keep moving.

When it disappears, the Agent does not crash immediately. It can still talk, call tools, and return results. But it has already started forgetting.

It forgets which paths were ruled out, why the current path was chosen, and what problem the user was actually trying to solve.

The result is a nasty kind of degradation: it gets worse, and the task drifts.

This class of bug is painful because it is not a crash. It does not give you a clean stack trace. It slowly shows up in production as a feeling that the Agent has become strangely bad.

So context management cannot be a blunt token-count cut.

At minimum, we need to separate three categories:

Do not casually compress: decision rationale, task intent, hard constraints, reasoning path.
Can compress: intermediate observations, tool outputs, process material.
Can drop: formatting helpers, redundant explanations, temporary display content.

Reasoning history is not cache garbage. In many cases, it is the Agent's working memory.

You may think you are saving tokens. You may actually be removing the part of the system that lets the Agent stay on task.

Every Prompt Line Is Code

The third bug is just as important.

How can adding one line that says "say less" reduce coding quality?

Because in model behavior, less output and less thinking are not always separate things.

If you require the final answer to be under 100 words and text between tool calls to be under 25 words, the model may not only compress expression. It may compress the decision process.

This is not a traditional bug. The model is sincerely optimizing for the target you gave it.

That is why Anthropic's follow-up discipline matters: every system prompt change should be ablated per model; if a line can be tested line by line, test it line by line; changes that may affect intelligence need gradual rollout and a soak period.

That sounds heavy.

But if you accept that prompts are production code, the discipline is not excessive.

Clean Test Environments Are Unlike Production

There is a common Agent engineering problem that people do not like to face:

The test environment is too clean.

So clean that it stops looking like production.

Many Agent failures are not as simple as "input A produces wrong output B." They depend on a sequence of state:

session idle for more than one hour
resume
continue multi-turn tool use
thinking gets cleared
enter the next turn
thinking gets cleared again

That kind of state sequence is hard to cover with unit tests and easy to miss in e2e tests.

We test features. Production runs a state machine.

That is also why Anthropic later had more internal employees use the exact same Claude Code build as the public release, rather than a test build.

Real usage is still the best integration test.

Defaults Are the Product

Now back to the first bug.

Changing high to medium made sense from an engineering angle: fewer freezes, lower latency, and only a slight intelligence drop.

The problem is that "slight drop" is benchmark language. It may not match user experience.

Users are not running average benchmarks. They are working with their code, their context, their workflow, and their messy problems.

In those highly personalized tasks, a small capability drop can feel like: it no longer understands me, it is not as sharp as before, it is making basic mistakes.

The product may offer a setting, but most users will not change it. The default is the product decision.

Closing

The most valuable part of this incident review is not the reminder that Claude Code can have bugs.

That is normal.

The valuable part is the reminder that Agent reliability often fails outside the model itself, in nearby system decisions that look local, technical, and low-risk.

Default parameters, cache strategy, context trimming, prompt constraints, and differences between test and production builds can all change Agent behavior.

Building an Agent is not just calling a smart model.

It is maintaining a complex system that can act, forget, misunderstand goals, and be shaped by context.

So the core discipline of Agent engineering may be this:

Do not only ask whether a change made the system faster, cheaper, or shorter. Also ask whether it took away the memory the model needs to finish the task.