DEV Community

Cover image for Harness bugs, not model bugs
Andrew Kew
Andrew Kew

Posted on

Harness bugs, not model bugs

For six weeks, developers have been complaining that Claude got worse. You've seen the posts — "Claude Code is flaky", the AI-shrinkflation discourse.

Yesterday, Anthropic shipped a postmortem. Three unrelated bugs, all in the Claude Code harness. The model weights and the API were never touched.

That distinction is the whole point.

What actually broke

  1. Default reasoning effort got dialled down. A UX fix dropped Claude Code's default from "high" to "medium" in early March. Users noticed it felt dumber. Reverted April 7.
  2. A caching optimisation dropped prior reasoning every turn. Supposed to clear stale thinking once per idle session; a bug made it fire on every turn. Claude kept executing without memory of why. Surfaced as forgetfulness and repetition. Fixed April 10.
  3. A verbosity system prompt hurt coding quality. "Keep responses under 100 words." Internal evals missed a 3% regression on code. Caught by broader ablations. Reverted April 20.

None of this was a model change. The weights didn't move. The API was never in scope.

The lesson

The model is not the product. What your users experience is model + harness + system prompt + tool wiring + context management + caching. Each layer has its own bugs. When someone says "Claude got worse," the weights are usually the last thing that changed.

API-layer products were unaffected. If you're building directly against the Messages API, none of these bugs touched you. This is why "am I on Claude Code, or am I on the raw API?" matters.

"Eval passed" ≠ "no regression." The verbosity prompt passed Anthropic's initial evals. Only a broader ablation — removing lines one at a time — caught the 3% drop. Fixed eval suites miss behavioural drift; ablations catch it.

What to actually do

  • On Claude Code? Update to v2.1.116+. You're already through it. Usage limits got reset as an apology.
  • On the API directly? Nothing to do. Stay on whatever model you were on.
  • Shipping your own harness on top of a frontier model? Read the postmortem twice, then audit your prompt + caching + context-management pipeline for the same silent-failure modes. The bugs Anthropic described are exactly the ones every harness reinvents.

The meta-lesson is boring and important: most of the quality variance lives between "the model" and "the thing your user sees." Ship good harnesses.


✏️ Drafted with KewBot (AI), edited and approved by Drew.

Top comments (0)