GDS K S

Posted on Apr 23

Claude Code Felt Off for a Month. Here Is What Broke.

#ai #webdev #tutorial #productivity

For about four weeks in March and April, Claude Code felt noticeably worse. I was not imagining it. Sessions got repetitive. Tool calls chose weird paths. My usage limits emptied faster than the work I got out of them. I kept switching models, clearing context, restarting CLIs, convinced I had broken something in my own setup.

Then on April 23 Anthropic published a postmortem. Three separate bugs, all resolved by April 20. Usage limits were reset for every subscriber as compensation. The apology was thorough. The more interesting part is what the timeline reveals about the failure modes of agent products, and what that means for any developer building on top of one.

TL;DR

The three bugs overlapped. All three felt like the same thing to users: Claude got dumber. None would have been caught by a normal test suite.

#	Window	Bug	Fix
1	Mar 4 – Apr 7	Reasoning effort default lowered `high` → `medium` for latency	Reverted Apr 7
2	Mar 26 – Apr 10	Cache cleared "thinking blocks" every turn after 1h idle, not once	CLI `v2.1.101`
3	Apr 16 – Apr 20	Prompt capped responses to 25/100 words, cost ~3% benchmark intelligence	CLI `v2.1.116`

The three bugs, expanded

Bug 1: the silent default change

On March 4, Anthropic lowered the default reasoning effort from high to medium to reduce latency. It was a product call, not a bug per se, and the reasoning was defensible. Users noticed anyway. Reverted April 7.

Lesson: a "config" change to an agent's reasoning budget is a model change. It deserves a changelog entry. Quietly tuning it down to hit a latency target is the kind of decision that reads reasonable in a meeting and awful in a transcript.

Bug 2: the caching regression

This one is the scariest of the three. A prompt-cache optimization shipped March 26. The intent was to clear "thinking blocks" from the cache after an hour of inactivity, once. The implementation cleared them on every turn after that first hour.

The symptom was that Claude became forgetful. It repeated itself. It picked strange tools. It burned through usage quotas far faster than the same session would have a week earlier. None of that raised an HTTP 500 or fired a test alert. It only manifested as "this feels worse."

Click for the exact shape of the regression

The cache was supposed to do this:

t=0:  start session, thinking blocks written to cache
t=1h: cache cleared once due to inactivity
t=1h+1turn: normal flow resumes, new thinking blocks cached

Instead it did this:

t=0:  start session, thinking blocks written to cache
t=1h: cache cleared due to inactivity
t=1h+1turn: new thinking blocks cached... then cleared
t=1h+2turn: thinking blocks cached... then cleared
[repeat every turn]

Each turn paid the full cost of re-deriving reasoning from scratch, and lost the prior context.

Bug 3: the word-limit constraint

Someone added this to the system prompt on April 16:

keep text between tool calls to ≤25 words.
keep final responses to ≤100 words.

It read fine. It even looks like the kind of instruction I would write to tighten my own agents. What it actually did was clip the model by roughly 3% across the benchmark suite. Reverted April 20.

What I learned running projects on top of this

Two lessons stuck.

Agent regressions are diffuse. A classic API regression has a clear signal. The test fails, the error rate spikes, the latency chart moves. An agent regression shows up as my work feels worse. The only way I caught anything was by noticing I was redoing work more often, not because a metric turned red. If you are building tools that route to an agent, you need a drift signal that does not rely on exceptions or HTTP status codes.

Caching is the most dangerous optimization in agent systems. The bug with the biggest blast radius was, on paper, a small optimization with an obvious upside. It was not visible until you imagined what happens when "clear thinking blocks after inactivity" runs every turn instead of once. In my own tooling I have pulled back on aggressive cache-invalidation heuristics since reading this.

If it is not obvious that the cache key captures every meaningful state change, I would rather pay for the tokens.

What this means for tool builders

I build a small local tool that orchestrates Claude to draft and publish articles. Reading this postmortem changed a few defaults in how I build.

1. Version your prompts

Every system prompt my tooling ships now embeds a version string in a trailing comment. Git tells me the same thing, but an ad hoc marker is faster to spot in a transcript.

const SYSTEM_PROMPT = `You are a writing assistant for Dev.to articles.
Follow the style guide. Prefer short sentences for impact.
<!-- prompt-v: 2026-04-23.3 -->`;

When a session feels off, I grep for the version in the transcript. If it has drifted, I know where to look.

2. Log tokens per turn

A silent spike in tokens-per-turn is exactly what the caching bug would have produced in my transcripts. That was not a thing I was watching for.

logger.info("turn", "completed", {
  turn: turnIndex,
  inputTokens: usage.input_tokens,
  outputTokens: usage.output_tokens,
  cacheHit: usage.cache_read_input_tokens ?? 0,
  promptVersion: "2026-04-23.3",
});

A rolling average of input tokens per turn is a cheap drift detector. If it doubles on a Tuesday and your code has not changed, the thing underneath you has.

3. Stop testing agents with toy tasks

The bugs Anthropic describes would not have shown up in a short demo. They only surface in sessions that run long enough for the agent to need its earlier reasoning, or for the cache to kick in.

Eval type	Catches bug 1?	Bug 2?	Bug 3?
Single-prompt accuracy bench	maybe	no	no
Short multi-turn demo	no	no	maybe
30-minute real session	yes	yes	yes

If you are benchmarking an agent for production use, your evaluation needs sessions that last at least half an hour of real work. Anything shorter gives false confidence.

The tradeoff nobody is naming

Here is the uncomfortable part. The fixes Anthropic describes are the kind you only find by rolling back and listening to complaints. Each of the three initial changes was made for a reasonable-sounding reason.

Change	Stated reason	Actual cost
Lower default effort	Latency matters	Perceived IQ drop
Cache aggressively	Money matters	Thinking history destroyed every turn
Constrain response length	Readability matters	3% benchmark hit

The problem is that agent quality is multi-dimensional, and the postmortem made visible something most vendors quietly hope users do not notice. You cannot A/B test an agent the way you A/B test a search ranker. The feedback loop is noisy, delayed, and routed through the subjective experience of people who already had a rough day.

Anthropic did the right thing by publishing the timeline. Most vendors do not.

At Glincker

At Glincker I am building tooling that assumes the underlying model is going to wobble, not that it is going to be stable. The orchestration layer logs enough to catch regressions my users cannot articulate. The retry logic notices when the agent is burning effort without making progress. The postmortem did not change my plan. It confirmed it.

The bottom line

Three bugs, four weeks, an ecosystem of developers quietly wondering if they should switch tools.
Short evals would not have caught any of them.
Prompt versions, per-turn token logging, and long-session evals are the minimum viable drift detectors.
When you build on top of an agent, you are depending on a product team's prompt, their cache policy, and their default settings. All three can change without a version bump you would notice.

Read the full postmortem. Then look at whether your own tooling would have caught any of these regressions. Mine would not have, fast enough. I have some work to do.

Top comments (1)

PEACEBINFLOW • Apr 24

The distinction between a config change and a model change collapsing in an agent product is the idea I'm still turning over. In a traditional API, changing a default parameter is a footnote. In an agent, changing the reasoning budget changes the product. There's no clean line between "infrastructure" and "behavior" when the infrastructure is doing the thinking.

What strikes me is that all three bugs were individually reasonable decisions viewed through a single lens. Lower default reasoning to improve latency—reasonable. Clear cache after inactivity—reasonable. Cap response length for readability—reasonable. But agent quality isn't single-lens. It's this diffuse, multi-dimensional thing that only degrades in ways users feel as vibes. "Claude felt dumber" isn't a metric you can put on a dashboard, but it's the only metric that actually matters.

The caching bug in particular reads like a classic distributed systems gotcha dressed up in AI clothing. A cache invalidation that fires every turn instead of once is just a state management bug, but because the state is the model's memory, the symptom is degraded reasoning. The fix is a CLI patch, not a model update. That feels like the shape of a lot of future incidents: infrastructure bugs that present as intelligence failures, because intelligence is increasingly an infrastructure property.

Makes me wonder if we need a new category of observability for agent systems—something that monitors not error rates or latency percentiles, but the subjective quality density of the output. Token-per-turn logging is a start, but it's a proxy. The real signal is "did the session feel productive?" and that's still stuck in the realm of human intuition. How close can we get to measuring that mechanically before it starts to feel like we're just building vibe checkers for AI?