For six weeks, developers have been complaining that Claude got worse. You've seen the posts — "Claude Code is flaky", the AI-shrinkflation discourse.
Yesterday, Anthropic shipped a postmortem. Three unrelated bugs, all in the Claude Code harness. The model weights and the API were never touched.
That distinction is the whole point.
What actually broke
- Default reasoning effort got dialled down. A UX fix dropped Claude Code's default from "high" to "medium" in early March. Users noticed it felt dumber. Reverted April 7.
- A caching optimisation dropped prior reasoning every turn. Supposed to clear stale thinking once per idle session; a bug made it fire on every turn. Claude kept executing without memory of why. Surfaced as forgetfulness and repetition. Fixed April 10.
- A verbosity system prompt hurt coding quality. "Keep responses under 100 words." Internal evals missed a 3% regression on code. Caught by broader ablations. Reverted April 20.
None of this was a model change. The weights didn't move. The API was never in scope.
The lesson
The model is not the product. What your users experience is model + harness + system prompt + tool wiring + context management + caching. Each layer has its own bugs. When someone says "Claude got worse," the weights are usually the last thing that changed.
API-layer products were unaffected. If you're building directly against the Messages API, none of these bugs touched you. This is why "am I on Claude Code, or am I on the raw API?" matters.
"Eval passed" ≠ "no regression." The verbosity prompt passed Anthropic's initial evals. Only a broader ablation — removing lines one at a time — caught the 3% drop. Fixed eval suites miss behavioural drift; ablations catch it.
What to actually do
- On Claude Code? Update to v2.1.116+. You're already through it. Usage limits got reset as an apology.
- On the API directly? Nothing to do. Stay on whatever model you were on.
- Shipping your own harness on top of a frontier model? Read the postmortem twice, then audit your prompt + caching + context-management pipeline for the same silent-failure modes. The bugs Anthropic described are exactly the ones every harness reinvents.
The meta-lesson is boring and important: most of the quality variance lives between "the model" and "the thing your user sees." Ship good harnesses.
✏️ Drafted with KewBot (AI), edited and approved by Drew.
Top comments (0)