What Anthropic's April 23 Postmortem Reveals About Your Agent Harness

#claudecode #agents #production #ai

The April 23 Claude Code postmortem dropped last week. Three bugs, two months of degraded output, one usage-limit reset for every Pro subscriber.

I read it twice. The second time I started writing notes for my own agent harness.

It is unusually candid for a company at this scale, and it reads like a checklist of failure modes any team running production AI agents will eventually hit. Worth treating as a free engineering review.

Defaults that nobody can see

On March 4, the default reasoning effort dropped from high to medium. The reason was real. High mode was freezing the UI for some users. The fix was reasonable. The interesting bit: it shipped without an operator-visible knob, and quality regressed for a month before users complained loud enough.

Open question for your harness: how many silent defaults does it have? Temperature 0.7 because that was the framework default in 2024. Top-p 1.0 because nobody touched it. Max tokens 4096 because somebody picked the number once. Each of these is a quality lever. Which ones are worth surfacing in your dashboard?

A line worth saving from the postmortem: "users told us they'd prefer higher intelligence and opt into lower effort for simple tasks." Defaults can optimize for quality, with cost concerns as opt-in rather than opt-out.

A cache rule that ate the working memory

On March 26 they shipped a thinking-cache clearing rule. Intent: clear reasoning history once after a session sits idle for more than an hour. Bug: it cleared on every turn for the rest of the session. Sessions felt forgetful. Tool choices got weird. Usage limits depleted faster because the model was rebuilding context every turn.

I have shipped this exact bug. Different system, same shape. A "small optimization" to a caching layer that turned every cache lookup into a miss. Cost went up 4x for two days before alerting caught it.

Useful question to bring to your team: do our caching tests cover multi-turn behavior, or only single-call hit/miss? Most teams I have asked answer "single call only". Surfacing that gap costs an afternoon and saves a quarter.

A 25-word cap that cost 3% intelligence

On April 16 they added a system prompt: limit text between tool calls to 25 words, final responses to 100. The intent was cleaning up verbose narration. After ablation testing, they measured a 3% intelligence drop on coding tasks and reverted four days later.

Three percent doesn't sound like much, which is the part that stays with me. A prompt change hurting quality by 3% is invisible to anyone not running ablations. How many of us are? The honest answer in most rooms I sit in: not many.

Worth asking out loud: if you change a system prompt today, what catches a 3% regression?

What two of three tells you

Of the three bugs, two were silent until users yelled. The third was visible only after dedicated ablation testing. That ratio is the most interesting line in the whole postmortem.

I run six production agents. I have eval coverage on three. The other three I monitor with output sampling and gut feel. That setup is probably close to median for the industry.

The postmortem hands you a free checklist anyway. Default knobs visible to operators. Cache-hit rate tracked across multi-turn conversations. System prompts gated by eval ablations. Three failure modes, each one a useful question to ask your own setup.

Did you check your harness this week?

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.