Claude Code didn't get worse. The harness did. And that ends one of the most common AI complaints of 2026.

#ai #anthropic #engineering #devtools

For two months, the same complaint kept showing up on every developer forum I read: Claude Code feels worse. Sometimes worded politely, sometimes not. The vibe was unanimous enough that I almost started believing it on reputation alone.

Then on April 23, Anthropic published a postmortem that I think ends this whole class of complaint as a useful conversation. Read it. Even if you don't ship anything with Claude. Especially then.

Here's the position I'll defend: "the model got worse" is no longer a credible developer complaint without evidence. The Anthropic postmortem is proof that the user experience of an LLM product can degrade severely without anyone touching the weights. From now on, the responsible reply to "Claude feels worse this week" is show me the harness diff, not the model card.

What actually broke

The thing that should make every AI product engineer sit up: none of the three regressions were model weights. They were all in the layer most teams treat as boring infrastructure.

Regression 1 — reasoning depth got quietly downgraded. On March 4, Anthropic moved the default reasoning effort from high to medium to cut latency. Users reported lower intelligence. The complaints were real. The model was the same. The default wasn't. They reverted on April 7.

Regression 2 — a caching bug ate prior reasoning. On March 26, an intended one-time clearing of old thinking in stale sessions was applied repeatedly. So context kept getting amputated mid-conversation. The model felt forgetful because it actually was forgetting. Fixed April 10.

Regression 3 — a brevity instruction tanked coding output. On April 16, a strict length nudge in the system prompt went out. It looked harmless. It wasn't. Anthropic's own expanded evals showed measurable coding quality drops. Reverted April 20.

The whole stack was clean again by April 20 in v2.1.116. InfoQ's writeup is a useful secondary read, but the original is better because it gives you the timelines.

Why this is the most important engineering document of 2026 (so far)

I don't say that lightly. Three reasons.

One: it kills the lazy mental model. Most teams I talk to debug AI features the way they debug a database query — assume one thing changed, find that one thing. Anthropic's incident shows the product layer is now a distributed system with its own failure modes: defaults, caches, prompts, all moving independently, on different timelines, affecting different traffic slices. You can't reason about it like a single component anymore.

Two: it sets a transparency precedent that other labs now have to match. Once one major lab publishes timelines, root causes, eval deltas, and reversion dates for a quality regression, the others can't keep claiming "we don't comment on user feedback." The bar moved.

Three: it implies that most teams shipping LLM products lack the reliability tests they need. If three independent changes can pass review and ship without anyone catching the cumulative quality cost, that's not an Anthropic problem. That's a we as an industry haven't figured out evals for harness changes yet problem. I would bet most teams reading this have a CI that runs unit tests on prompts approximately never.

The thing I want every AI product team to internalize

Your model isn't your system. Your harness is your system.

The harness is:

which model variant you call by default
which reasoning depth you allow by default
what survives a cache hit and what doesn't
what the system prompt nudges
which tools are allowed in which contexts
what the timeout / retry / fallback shape is

If you don't have an eval that runs when any of those change, you are flying blind. The model is the input. The harness is the product. Treat changes to the harness like you treat code changes — with reviews, rollout gates, eval deltas, and a rollback playbook.

I think this is going to become the new bar for what "shipped responsibly" means in AI products. The teams that take it seriously this year will be the ones that look stable in 2027. The teams that don't will spend 2027 explaining quality regressions to angry users without any real diagnostic ability.

What I want pushback on

I want to be honest about where I might be overclaiming.

The skeptical read is: "Sure, this incident was harness-side. That doesn't mean all user complaints are harness-side. Some models really do degrade over time — distillation cycles, RLHF drift, evaluation Goodharting." That's fair. I'm not claiming model weights are sacred. I'm claiming the burden of proof flipped.

When someone says "the model got worse," the productive next question is: can you share a prompt + output that was good last month and bad this month, with timestamps? If they can, you have evidence. If they can't, you're working from vibes and the harness is the more likely culprit.

Where I want disagreement: if you think the harness-vs-weights distinction is too clean — that they're entangled in ways that make the framing misleading — I want to read your argument. I'm leaning hard on the separation. Convince me it's fragile.

What this changes for engineers shipping LLM features

Concrete actions worth doing this quarter, in priority order:

Inventory your harness surface. Write down every knob: default model, default reasoning depth, system prompt, cache TTL, retry policy, tool-allow lists. You should be able to hand a new engineer one page that tells them what your product actually sends to the model.
Build a harness eval that runs on every change to any of those knobs. Doesn't have to be fancy. 50 representative prompts with golden outputs is enough to start. The point is catching regressions before users do.
Treat prompt edits as production changes. Reviews, rollout gates, the works. Yes, even the "just one more sentence" edits.
Log enough trace data to reproduce a complaint. Session ID, prompt version, model variant, reasoning depth, cache state. When a user says "this got worse," you should be able to pull up the actual call.
Write your own postmortems publicly. Anthropic raised the bar. The teams that meet it will earn trust that the silent ones can't.

If your team has shipped a quality regression in an LLM product and survived it, I'd love to know what you learned — especially the first thing that broke. My guess is it almost always wasn't the model.

Top comments (1)

keesan.eth • May 28

Yep. Half the 'Claude got worse' takes are harness drift, not model drift. If the parent stopped enforcing scope, stop conditions, and receipts, the same model will look dumber + way more expensive real fast.