I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

#ai #machinelearning #programming #productivity

Every few weeks a thread blows up: "Is Claude Code getting worse?" Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.

So I built a tracker. For ~95 days it's logged the daily SWE-Bench-Pro pass rate for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the "price" is how often the agent actually solves the task.

Here's what the data says — and it's more interesting than "it got dumber."

Claude Code: a real step up, then a recent slide

Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:

Opus 4.6 era — baseline ~54%
Opus 4.7 era — baseline ~65%

That 4.6 → 4.7 jump is a genuine +11 percentage point step. Not placebo — the model got materially better at finishing tasks, and it held ~65% steady for a month.

Then the last ~7 days: today's pass rate is ~52%, well below the 65% baseline and past the significance threshold (p < 0.05). So the "Claude Code feels worse lately" crowd isn't imagining it — there's a real, recent drift below the current model's own established baseline. Whether it's a quantization change, a routing tweak, or load — the number moved, and it moved past noise.

The nuance most threads miss: Claude Code is both "much better than 6 months ago" and "drifting down this week." Both are true. Vibes can't hold two facts at once; data can.

Codex: three versions, basically flat

Now the part nobody expects. Across three Codex releases:

gpt-5.3-codex — ~58%
gpt-5.4-xhigh — ~54%
gpt-5.5-xhigh — ~56%

Three "major" version bumps, and the pass rate just oscillates in a 54–58% band. No step change. The releases didn't move the benchmark needle the way Opus 4.7 did. If you've felt like "new Codex doesn't feel smarter" — the data agrees: it's been flat.

Why candlesticks (and a fixed 0–100 axis)

Two design choices that matter if you want to read drift honestly:

Fixed 0–100% y-axis. Auto-scaling per time window makes a 5pp dip look catastrophic because the view zooms in. A 5pp drop should look like a 5pp drop whether you're comparing 30 days or 90, Claude or Codex.
Per-era baselines, not one flat line. A single baseline across model versions lies about the older model. Each release gets its own dashed reference, so you can see the step, not just the absolute level.

The live, daily-updating version (red/green toggle for CN vs Western convention, daily/weekly K, 30/90/all windows per agent) is here: Drift K-Line tracker →

What this means if you ship with these agents

Don't trust a single bad day. One red candle is inside the noise band. A week below baseline is signal. Watch the baseline line, not the last datapoint.
"Newer version" ≠ "smarter." Codex's flat line is the proof. Benchmark before you migrate a workflow to a new release.
Capability drifts. Your costs shouldn't have to. If an agent quietly drops 13pp, the last thing you want is to also be locked into one vendor's pricing while you wait it out.

Author note: I build keaiapi, a pay-as-you-go aggregator that routes Claude, GPT, Gemini, DeepSeek and 20+ models through one OpenAI-compatible endpoint — so when a model drifts, you can switch the one you point at without rewriting code or eating a subscription. The tracker above is a free tool we run; no signup needed to read it. Methodology notes are on the tracker page.

Top comments (1)

Harjot Singh • May 31

95 days of actual pass-rate tracking is the antidote to the endless vibes-based "Claude got dumber" threads - measuring instead of feeling is rare and valuable. The interesting question your data can settle: is perceived degradation real (quantization/routing changes on the provider side), or is it the user's bar rising + novelty wearing off + harder tasks over time? Both produce the same "feels worse" sensation from completely different causes.

The strategic takeaway regardless of the answer: if a model you depend on can silently drift in quality and you'd only notice via systematic tracking, that's an argument for not being locked to one model. Being model-agnostic means a provider's bad week is a routing change, not a crisis - which is a core reason Moonshift (a multi-agent pipeline shipping a prompt to a real SaaS) routes across models rather than marrying one. ~$3 flat per build, first run free. Excellent data work - did you find real degradation, or was it mostly variance/perception once you had the numbers? That's the question everyone argues without measuring.