BestCodes

Posted on Apr 1

Anthropic just dropped Claude Sonnet 5, and the benchmarks are kind of insane

#ai #programming #productivity #discuss

Actually an April Fool's Day prank

Okay so Anthropic quietly pushed a blog post live this morning and I think it's flying under the radar a bit — Claude Sonnet 5 is officially out as of today. Model string is claude-sonnet-5-20260401, already live in claude.ai as the new default and on the API at the same $3/$15 per million tokens pricing as Sonnet 4.6. No price hike. That part alone is worth stopping to think about.

What actually changed

The headline number is 92.4% on SWE-bench Verified. For context: Claude Opus 4.6, their previous flagship, sat at 80.8%. GPT-5.4 scores 57.7% on the same eval. Gemini 3.1 Pro is at 80.6%. Sonnet 5 just... leapfrogged all of them — including Anthropic's own Opus tier — at Sonnet pricing. That's a 12-point jump over Opus 4.6 in a single generation, from the mid-tier model.

Computer use is the other big story. 88.3% on OSWorld-Verified. The human expert baseline on that benchmark is 72.4%, meaning Sonnet 5 isn't just competitive with humans on desktop automation — it's meaningfully ahead. GPT-5.4 scored 75.0% when it launched last month, which was already a big deal. Sonnet 5 blows past it.

On the reasoning and science side: 96.2% on GPQA Diamond (PhD-level science questions). Gemini 3.1 Pro held the record on that one at 94.3% — Sonnet 5 takes it. And ARC-AGI-2, the abstract novel-reasoning benchmark that basically nobody was doing well on until recently: 84.7%. Gemini 3.1 Pro was at 77.1%, which itself was considered remarkable. Sonnet 5 is 7+ points ahead.

Why this matters for the competitive picture

The past few months have been genuinely interesting to watch. GPT-5.4 dropped on March 5th and made a lot of noise around computer use and context windows. Gemini 3.1 Pro launched February 19th and topped the GPQA Diamond leaderboard. Anthropic had Sonnet 4.6 as their current mid-tier, which was already punching above its weight — developers preferred it over Opus 4.5 59% of the time in head-to-heads.

But Sonnet 5 resets the entire scoreboard. Every benchmark category, across coding, computer use, abstract reasoning, and science knowledge — Sonnet 5 leads. And not just by a little in most cases. The SWE-bench number in particular is striking because that one is hard to game; it's measuring real GitHub issue resolution on novel codebases.

The pricing situation is what really stands out though. Gemini 3.1 Pro at $2/1M input is the cheapest frontier option. GPT-5.4 is at $2.50. Sonnet 5 is $3. For slightly higher input cost than GPT-5.4, you're getting significantly better performance across almost every axis. And compared to Opus 4.6 at $15/1M input, you're getting better benchmark scores at one-fifth the cost.

Context window and features

Sonnet 5 ships with the 2M token context window, now out of beta (the 1M window from Sonnet 4.6 has been promoted to stable and the 2M is available with the context-2m header). The adaptive thinking architecture from the 4.6 generation is still there, upgraded — Anthropic says the model dynamically allocates reasoning depth more effectively than before, which is probably where a lot of the benchmark gains are coming from.

Claude Code users on the early access build are already reporting noticeably better performance. Per Anthropic's internal numbers, devs preferred Sonnet 5 over Sonnet 4.6 in Claude Code roughly 82% of the time — citing fewer hallucinated completions, better cross-file context retention, and significantly improved frontend output quality.

The benchmarks

I grabbed the charts from Anthropic's launch page — attaching them here. The SWE-bench and OSWorld comparison is the one worth saving. The full comparison table at the bottom shows every major model side-by-side.

Quick take

If you're an Anthropic API user, just migrate to claude-sonnet-5-20260401. The improvements are across the board, the pricing is identical to what you were paying before, and based on what early users are saying in the dev Discord this morning it's a pretty significant day-to-day difference in actual use.

If you're not using Claude but you're on GPT-5.4 or Gemini 3.1 Pro — this is probably worth a benchmark run on your specific workload. Especially if you're doing anything coding-heavy or involving computer use. The numbers are hard to argue with.

Crazy time to be following this space. The pace of releases in Q1 2026 has been genuinely relentless — GPT-5.4 in early March, Gemini 3.1 Pro mid-February, Opus 4.6 in early February, and now this. No signs of slowing down.

Happy April Fool's Day! 🎉

Top comments (9)

BestCodes • Apr 1

Happy April Fool's Day!

BestCodes • Apr 2

Dev.to did this. I wonder if a moderator applied it?

Anmol Baranwal • Apr 2

😂 no way I was genuinely reading

BestCodes • Apr 2

lol

Varshith V Hegde • Apr 2

Nice I was locked in

BestCodes • Apr 2

Haha 😆

Rizèl Scarlett • Apr 3

im gullible lol

BestCodes • Apr 4

lol

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more