A0mineTV

Posted on Feb 18

SWE-bench, Agentic Coding, and What Actually Changed from Claude Sonnet 4.5 to 4.6

#ai #benchmarking #programming #productivity

Benchmarks are useful. Benchmarks are also easy to misread.

This post explains what SWE-bench Verified measures, how it relates to agentic coding, and what changed in practice between Claude Sonnet 4.5 and Claude Sonnet 4.6.

1) SWE-bench in one paragraph

SWE-bench is a benchmark built from real GitHub issues and pull requests across popular open-source repositories. The task is simple to describe and hard to do well: you get a repo + an issue description, and you must produce a patch that makes failing tests pass.

It’s widely used because it’s closer to real software work than “write a function” coding prompts.

Key references

SWE-bench overview and original description: https://www.swebench.com/SWE-bench/
SWE-bench paper (ICLR 2024): https://arxiv.org/abs/2310.06770

2) Why “SWE-bench Verified” exists

OpenAI introduced SWE-bench Verified because some SWE-bench tasks can be ambiguous or effectively unsolvable, which can systematically under-estimate model ability. Verified is a human-validated subset of 500 tasks that engineers confirmed are solvable.

Key references

OpenAI announcement: https://openai.com/index/introducing-swe-bench-verified/
Dataset page (500 verified samples): https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified
SWE-bench datasets guide (lists “Verified — 500 instances”): https://www.swebench.com/SWE-bench/guides/datasets/

3) What people mean by “agentic coding”

“Agentic coding” usually means:

the model doesn’t just output code, it operates like an engineer:
- explores the repo,
- runs tests,
- reads logs,
- edits multiple files,
- writes/updates tests,
- iterates until it passes,
- and (ideally) explains what it changed and why.

This is why SWE-bench (and especially Verified) is often used as a proxy for “can it fix real bugs in real repos”.

A second proxy: Terminal-Bench 2.0

Agentic coding often happens in a terminal: installing deps, running test suites, grepping logs, editing config, etc.

Terminal-Bench 2.0 focuses specifically on this terminal-style, tool-using, end-to-end competence. It’s a suite of interactive tasks run inside containers, designed to test whether an agent can actually finish realistic command-line work.

Key references

Terminal-Bench paper (Harbor tasks + harness): https://arxiv.org/html/2601.11868v1
Terminal-Bench repo: https://github.com/laude-institute/terminal-bench
Terminal-Bench 2.0 announcement (Harbor framework): https://www.tbench.ai/news/announcement-2-0

4) A benchmark snapshot (Sonnet 4.5 vs Sonnet 4.6 vs others)

Anthropic’s Sonnet 4.6 system card includes a table that reports (among other things):

SWE-bench Verified
Terminal-Bench 2.0

Here are the headline rows (percent pass rate):

Model	SWE-bench Verified	Terminal-Bench 2.0
Claude Sonnet 4.6	79.6%	59.1%
Claude Sonnet 4.5	77.2%	51.0%
Claude Opus 4.6	80.8%	65.4%
Claude Opus 4.5	80.9%	59.8%
Gemini 3 Pro	76.2%	56.2%
GPT-5.2 (all models)	80.0%	64.7%

Source: Claude Sonnet 4.6 System Card (table and surrounding methodology): https://anthropic.com/claude-sonnet-4-6-system-card

Note: SWE-bench Verified scores in that table are averaged over multiple trials; details and scaffolds matter.

5) The major differences: Sonnet 4.5 → Sonnet 4.6 (for coding)

Let’s translate numbers and “model release notes” into what you feel day-to-day.

5.1) Higher “real repo bugfix” success rate

On SWE-bench Verified:

Sonnet 4.5: 77.2%
Sonnet 4.6: 79.6% That’s a +2.4 percentage point gain.

If you think in “issues solved out of 500”, that’s roughly:

4.5: ~386/500
4.6: ~398/500 (Approximation, because the benchmark is averaged over trials and not always a single deterministic run.)

Why it matters: SWE-bench Verified is closest to “take an issue, ship a patch, tests pass.” A couple points here can mean fewer dead-ends when you delegate real bugfix work.

5.2) Much better terminal-loop performance (the agentic part)

On Terminal-Bench 2.0:

Sonnet 4.5: 51.0%
Sonnet 4.6: 59.1% That’s +8.1 points — the bigger jump.

Why it matters: in agentic workflows, the bottleneck is often iteration:

“run tests → inspect failure → adjust → rerun”

If your model is weak at CLI-level completion, it will burn tokens/time in loops without converging.

5.3) More “tool-first” behavior (and better guidance for it)

In the Sonnet 4.6 system card, Anthropic shows a prompt modification that improves SWE-bench Verified, emphasizing:

use tools heavily,
write your own tests early,
explore the codebase,
fix root causes, not symptoms.

Even if you don’t copy that prompt verbatim, it reflects the direction: Sonnet 4.6 is designed to behave more like a tool-using agent — and it responds well to agent-style instructions.

5.4) Better planning under long context

Anthropic markets Sonnet 4.6 as a “full upgrade” across:

coding,
long-context reasoning,
agent planning,
computer use.

This matters for real repos because “coding ability” is often limited by:

missing a constraint in a long issue thread,
misunderstanding a config file,
failing to connect test failures across modules,
or breaking an edge case you didn’t see in docs.

Reference: Sonnet 4.6 release post: https://www.anthropic.com/news/claude-sonnet-4-6

6) How to interpret these numbers without fooling yourself

Benchmarks like SWE-bench Verified and Terminal-Bench are not just “model IQ scores”. They combine:

model capability,
agent scaffold (tools, harness, permissions),
prompting and retry policies,
environment stability (deps, timeouts, etc.).

So use them like you’d use performance tests in engineering:

good for regression detection,
good for “directionally better/worse”,
risky to over-generalize to your stack.

7) Practical takeaway: when Sonnet 4.6 is the better choice

If you do any of these regularly, Sonnet 4.6’s improvements tend to show up immediately:

“Fix this failing CI pipeline”
“Patch this bug in a real repo”
“Refactor this module and keep tests green”
“Upgrade dependencies and adjust code until it builds”
“Ship a PR-sized change with minimal handholding”

And if your workflow is heavily terminal/tool based, the Terminal-Bench jump is the most relevant signal.

8) If you want to evaluate models for your own codebase

Here’s a lightweight “internal SWE-bench” you can run in a weekend:

Collect ~20 real issues/bugs from your history (closed tickets, failing tests, regressions).
For each issue, define:
- starting commit
- reproduction steps
- success condition (tests, snapshot, output)
Run each model with the same scaffold:
- same tools (read/write files, run tests, search)
- same retry policy
- same time budget
Track:
- success rate
- time-to-green
- number of tool calls
- diff size (patch quality proxy)
- “regressions introduced” (new failing tests)

This will tell you far more than any single public leaderboard.

Final thoughts

SWE-bench Verified tells you: “can it patch real repos reliably?”

Terminal-Bench tells you: “can it survive the messy terminal loop?”

Between Sonnet 4.5 and 4.6, the story is pretty consistent:

a modest gain in verified bugfix success,
a big gain in terminal/agent loop competence,
and more emphasis on agentic, tool-first workflows.

If you’re using Cursor/Claude Code or any tool-using dev agent, Sonnet 4.6 is the more practical daily driver — and the benchmarks explain why.

DEV Community

SWE-bench, Agentic Coding, and What Actually Changed from Claude Sonnet 4.5 to 4.6

1) SWE-bench in one paragraph

2) Why “SWE-bench Verified” exists

3) What people mean by “agentic coding”

A second proxy: Terminal-Bench 2.0

4) A benchmark snapshot (Sonnet 4.5 vs Sonnet 4.6 vs others)

5) The major differences: Sonnet 4.5 → Sonnet 4.6 (for coding)

5.1) Higher “real repo bugfix” success rate

5.2) Much better terminal-loop performance (the agentic part)

5.3) More “tool-first” behavior (and better guidance for it)

5.4) Better planning under long context

6) How to interpret these numbers without fooling yourself

7) Practical takeaway: when Sonnet 4.6 is the better choice

8) If you want to evaluate models for your own codebase

Final thoughts

Top comments (0)