Benchmarks are useful. Benchmarks are also easy to misread.
This post explains what SWE-bench Verified measures, how it relates to agentic coding, and what changed in practice between Claude Sonnet 4.5 and Claude Sonnet 4.6.
1) SWE-bench in one paragraph
SWE-bench is a benchmark built from real GitHub issues and pull requests across popular open-source repositories. The task is simple to describe and hard to do well: you get a repo + an issue description, and you must produce a patch that makes failing tests pass.
It’s widely used because it’s closer to real software work than “write a function” coding prompts.
Key references
- SWE-bench overview and original description: https://www.swebench.com/SWE-bench/
- SWE-bench paper (ICLR 2024): https://arxiv.org/abs/2310.06770
2) Why “SWE-bench Verified” exists
OpenAI introduced SWE-bench Verified because some SWE-bench tasks can be ambiguous or effectively unsolvable, which can systematically under-estimate model ability. Verified is a human-validated subset of 500 tasks that engineers confirmed are solvable.
Key references
- OpenAI announcement: https://openai.com/index/introducing-swe-bench-verified/
- Dataset page (500 verified samples): https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified
- SWE-bench datasets guide (lists “Verified — 500 instances”): https://www.swebench.com/SWE-bench/guides/datasets/
3) What people mean by “agentic coding”
“Agentic coding” usually means:
- the model doesn’t just output code, it operates like an engineer:
- explores the repo,
- runs tests,
- reads logs,
- edits multiple files,
- writes/updates tests,
- iterates until it passes,
- and (ideally) explains what it changed and why.
This is why SWE-bench (and especially Verified) is often used as a proxy for “can it fix real bugs in real repos”.
A second proxy: Terminal-Bench 2.0
Agentic coding often happens in a terminal: installing deps, running test suites, grepping logs, editing config, etc.
Terminal-Bench 2.0 focuses specifically on this terminal-style, tool-using, end-to-end competence. It’s a suite of interactive tasks run inside containers, designed to test whether an agent can actually finish realistic command-line work.
Key references
- Terminal-Bench paper (Harbor tasks + harness): https://arxiv.org/html/2601.11868v1
- Terminal-Bench repo: https://github.com/laude-institute/terminal-bench
- Terminal-Bench 2.0 announcement (Harbor framework): https://www.tbench.ai/news/announcement-2-0
4) A benchmark snapshot (Sonnet 4.5 vs Sonnet 4.6 vs others)
Anthropic’s Sonnet 4.6 system card includes a table that reports (among other things):
- SWE-bench Verified
- Terminal-Bench 2.0
Here are the headline rows (percent pass rate):
| Model | SWE-bench Verified | Terminal-Bench 2.0 |
|---|---|---|
| Claude Sonnet 4.6 | 79.6% | 59.1% |
| Claude Sonnet 4.5 | 77.2% | 51.0% |
| Claude Opus 4.6 | 80.8% | 65.4% |
| Claude Opus 4.5 | 80.9% | 59.8% |
| Gemini 3 Pro | 76.2% | 56.2% |
| GPT-5.2 (all models) | 80.0% | 64.7% |
Source: Claude Sonnet 4.6 System Card (table and surrounding methodology): https://anthropic.com/claude-sonnet-4-6-system-card
Note: SWE-bench Verified scores in that table are averaged over multiple trials; details and scaffolds matter.
5) The major differences: Sonnet 4.5 → Sonnet 4.6 (for coding)
Let’s translate numbers and “model release notes” into what you feel day-to-day.
5.1) Higher “real repo bugfix” success rate
On SWE-bench Verified:
- Sonnet 4.5: 77.2%
- Sonnet 4.6: 79.6% That’s a +2.4 percentage point gain.
If you think in “issues solved out of 500”, that’s roughly:
- 4.5: ~386/500
- 4.6: ~398/500 (Approximation, because the benchmark is averaged over trials and not always a single deterministic run.)
Why it matters: SWE-bench Verified is closest to “take an issue, ship a patch, tests pass.” A couple points here can mean fewer dead-ends when you delegate real bugfix work.
5.2) Much better terminal-loop performance (the agentic part)
On Terminal-Bench 2.0:
- Sonnet 4.5: 51.0%
- Sonnet 4.6: 59.1% That’s +8.1 points — the bigger jump.
Why it matters: in agentic workflows, the bottleneck is often iteration:
“run tests → inspect failure → adjust → rerun”
If your model is weak at CLI-level completion, it will burn tokens/time in loops without converging.
5.3) More “tool-first” behavior (and better guidance for it)
In the Sonnet 4.6 system card, Anthropic shows a prompt modification that improves SWE-bench Verified, emphasizing:
- use tools heavily,
- write your own tests early,
- explore the codebase,
- fix root causes, not symptoms.
Even if you don’t copy that prompt verbatim, it reflects the direction: Sonnet 4.6 is designed to behave more like a tool-using agent — and it responds well to agent-style instructions.
5.4) Better planning under long context
Anthropic markets Sonnet 4.6 as a “full upgrade” across:
- coding,
- long-context reasoning,
- agent planning,
- computer use.
This matters for real repos because “coding ability” is often limited by:
- missing a constraint in a long issue thread,
- misunderstanding a config file,
- failing to connect test failures across modules,
- or breaking an edge case you didn’t see in docs.
Reference: Sonnet 4.6 release post: https://www.anthropic.com/news/claude-sonnet-4-6
6) How to interpret these numbers without fooling yourself
Benchmarks like SWE-bench Verified and Terminal-Bench are not just “model IQ scores”. They combine:
- model capability,
- agent scaffold (tools, harness, permissions),
- prompting and retry policies,
- environment stability (deps, timeouts, etc.).
So use them like you’d use performance tests in engineering:
- good for regression detection,
- good for “directionally better/worse”,
- risky to over-generalize to your stack.
7) Practical takeaway: when Sonnet 4.6 is the better choice
If you do any of these regularly, Sonnet 4.6’s improvements tend to show up immediately:
- “Fix this failing CI pipeline”
- “Patch this bug in a real repo”
- “Refactor this module and keep tests green”
- “Upgrade dependencies and adjust code until it builds”
- “Ship a PR-sized change with minimal handholding”
And if your workflow is heavily terminal/tool based, the Terminal-Bench jump is the most relevant signal.
8) If you want to evaluate models for your own codebase
Here’s a lightweight “internal SWE-bench” you can run in a weekend:
- Collect ~20 real issues/bugs from your history (closed tickets, failing tests, regressions).
- For each issue, define:
- starting commit
- reproduction steps
- success condition (tests, snapshot, output)
- Run each model with the same scaffold:
- same tools (read/write files, run tests, search)
- same retry policy
- same time budget
- Track:
- success rate
- time-to-green
- number of tool calls
- diff size (patch quality proxy)
- “regressions introduced” (new failing tests)
This will tell you far more than any single public leaderboard.
Final thoughts
SWE-bench Verified tells you: “can it patch real repos reliably?”
Terminal-Bench tells you: “can it survive the messy terminal loop?”
Between Sonnet 4.5 and 4.6, the story is pretty consistent:
- a modest gain in verified bugfix success,
- a big gain in terminal/agent loop competence,
- and more emphasis on agentic, tool-first workflows.
If you’re using Cursor/Claude Code or any tool-using dev agent, Sonnet 4.6 is the more practical daily driver — and the benchmarks explain why.
Top comments (0)