DEV Community

Cover image for Claude Code Felt Off for a Month. Here Is What Broke.
GDS K S
GDS K S

Posted on

Claude Code Felt Off for a Month. Here Is What Broke.

For about four weeks in March and April, Claude Code felt noticeably worse. I was not imagining it. Sessions got repetitive. Tool calls chose weird paths. My usage limits emptied faster than the work I got out of them. I kept switching models, clearing context, restarting CLIs, convinced I had broken something in my own setup.

Then on April 23 Anthropic published a postmortem. Three separate bugs, all resolved by April 20. Usage limits were reset for every subscriber as compensation. The apology was thorough. The more interesting part is what the timeline reveals about the failure modes of agent products, and what that means for any developer building on top of one.

TL;DR

The three bugs overlapped. All three felt like the same thing to users: Claude got dumber. None would have been caught by a normal test suite.

# Window Bug Fix
1 Mar 4 – Apr 7 Reasoning effort default lowered highmedium for latency Reverted Apr 7
2 Mar 26 – Apr 10 Cache cleared "thinking blocks" every turn after 1h idle, not once CLI v2.1.101
3 Apr 16 – Apr 20 Prompt capped responses to 25/100 words, cost ~3% benchmark intelligence CLI v2.1.116

The three bugs, expanded

Bug 1: the silent default change

On March 4, Anthropic lowered the default reasoning effort from high to medium to reduce latency. It was a product call, not a bug per se, and the reasoning was defensible. Users noticed anyway. Reverted April 7.

Lesson: a "config" change to an agent's reasoning budget is a model change. It deserves a changelog entry. Quietly tuning it down to hit a latency target is the kind of decision that reads reasonable in a meeting and awful in a transcript.

Bug 2: the caching regression

This one is the scariest of the three. A prompt-cache optimization shipped March 26. The intent was to clear "thinking blocks" from the cache after an hour of inactivity, once. The implementation cleared them on every turn after that first hour.

The symptom was that Claude became forgetful. It repeated itself. It picked strange tools. It burned through usage quotas far faster than the same session would have a week earlier. None of that raised an HTTP 500 or fired a test alert. It only manifested as "this feels worse."

Click for the exact shape of the regression

The cache was supposed to do this:

t=0:  start session, thinking blocks written to cache
t=1h: cache cleared once due to inactivity
t=1h+1turn: normal flow resumes, new thinking blocks cached
Enter fullscreen mode Exit fullscreen mode

Instead it did this:

t=0:  start session, thinking blocks written to cache
t=1h: cache cleared due to inactivity
t=1h+1turn: new thinking blocks cached... then cleared
t=1h+2turn: thinking blocks cached... then cleared
[repeat every turn]
Enter fullscreen mode Exit fullscreen mode

Each turn paid the full cost of re-deriving reasoning from scratch, and lost the prior context.


Bug 3: the word-limit constraint

Someone added this to the system prompt on April 16:

keep text between tool calls to ≤25 words.
keep final responses to ≤100 words.
Enter fullscreen mode Exit fullscreen mode

It read fine. It even looks like the kind of instruction I would write to tighten my own agents. What it actually did was clip the model by roughly 3% across the benchmark suite. Reverted April 20.

What I learned running projects on top of this

Two lessons stuck.

Agent regressions are diffuse. A classic API regression has a clear signal. The test fails, the error rate spikes, the latency chart moves. An agent regression shows up as my work feels worse. The only way I caught anything was by noticing I was redoing work more often, not because a metric turned red. If you are building tools that route to an agent, you need a drift signal that does not rely on exceptions or HTTP status codes.

Caching is the most dangerous optimization in agent systems. The bug with the biggest blast radius was, on paper, a small optimization with an obvious upside. It was not visible until you imagined what happens when "clear thinking blocks after inactivity" runs every turn instead of once. In my own tooling I have pulled back on aggressive cache-invalidation heuristics since reading this.

If it is not obvious that the cache key captures every meaningful state change, I would rather pay for the tokens.

What this means for tool builders

I build a small local tool that orchestrates Claude to draft and publish articles. Reading this postmortem changed a few defaults in how I build.

1. Version your prompts

Every system prompt my tooling ships now embeds a version string in a trailing comment. Git tells me the same thing, but an ad hoc marker is faster to spot in a transcript.

const SYSTEM_PROMPT = `You are a writing assistant for Dev.to articles.
Follow the style guide. Prefer short sentences for impact.
<!-- prompt-v: 2026-04-23.3 -->`;
Enter fullscreen mode Exit fullscreen mode

When a session feels off, I grep for the version in the transcript. If it has drifted, I know where to look.

2. Log tokens per turn

A silent spike in tokens-per-turn is exactly what the caching bug would have produced in my transcripts. That was not a thing I was watching for.

logger.info("turn", "completed", {
  turn: turnIndex,
  inputTokens: usage.input_tokens,
  outputTokens: usage.output_tokens,
  cacheHit: usage.cache_read_input_tokens ?? 0,
  promptVersion: "2026-04-23.3",
});
Enter fullscreen mode Exit fullscreen mode

A rolling average of input tokens per turn is a cheap drift detector. If it doubles on a Tuesday and your code has not changed, the thing underneath you has.

3. Stop testing agents with toy tasks

The bugs Anthropic describes would not have shown up in a short demo. They only surface in sessions that run long enough for the agent to need its earlier reasoning, or for the cache to kick in.

Eval type Catches bug 1? Bug 2? Bug 3?
Single-prompt accuracy bench maybe no no
Short multi-turn demo no no maybe
30-minute real session yes yes yes

If you are benchmarking an agent for production use, your evaluation needs sessions that last at least half an hour of real work. Anything shorter gives false confidence.

The tradeoff nobody is naming

Here is the uncomfortable part. The fixes Anthropic describes are the kind you only find by rolling back and listening to complaints. Each of the three initial changes was made for a reasonable-sounding reason.

Change Stated reason Actual cost
Lower default effort Latency matters Perceived IQ drop
Cache aggressively Money matters Thinking history destroyed every turn
Constrain response length Readability matters 3% benchmark hit

The problem is that agent quality is multi-dimensional, and the postmortem made visible something most vendors quietly hope users do not notice. You cannot A/B test an agent the way you A/B test a search ranker. The feedback loop is noisy, delayed, and routed through the subjective experience of people who already had a rough day.

Anthropic did the right thing by publishing the timeline. Most vendors do not.

At Glincker

At Glincker I am building tooling that assumes the underlying model is going to wobble, not that it is going to be stable. The orchestration layer logs enough to catch regressions my users cannot articulate. The retry logic notices when the agent is burning effort without making progress. The postmortem did not change my plan. It confirmed it.

The bottom line

  • Three bugs, four weeks, an ecosystem of developers quietly wondering if they should switch tools.
  • Short evals would not have caught any of them.
  • Prompt versions, per-turn token logging, and long-session evals are the minimum viable drift detectors.
  • When you build on top of an agent, you are depending on a product team's prompt, their cache policy, and their default settings. All three can change without a version bump you would notice.

Read the full postmortem. Then look at whether your own tooling would have caught any of these regressions. Mine would not have, fast enough. I have some work to do.

Top comments (0)