DEV Community

A0mineTV
A0mineTV

Posted on

SWE-bench, Agentic Coding, and What Actually Changed from Claude Sonnet 4.5 to 4.6

Benchmarks are useful. Benchmarks are also easy to misread.

This post explains what SWE-bench Verified measures, how it relates to agentic coding, and what changed in practice between Claude Sonnet 4.5 and Claude Sonnet 4.6.

1) SWE-bench in one paragraph

SWE-bench is a benchmark built from real GitHub issues and pull requests across popular open-source repositories. The task is simple to describe and hard to do well: you get a repo + an issue description, and you must produce a patch that makes failing tests pass.

It’s widely used because it’s closer to real software work than “write a function” coding prompts.

Key references

2) Why “SWE-bench Verified” exists

OpenAI introduced SWE-bench Verified because some SWE-bench tasks can be ambiguous or effectively unsolvable, which can systematically under-estimate model ability. Verified is a human-validated subset of 500 tasks that engineers confirmed are solvable.

Key references

3) What people mean by “agentic coding”

“Agentic coding” usually means:

  • the model doesn’t just output code, it operates like an engineer:
    • explores the repo,
    • runs tests,
    • reads logs,
    • edits multiple files,
    • writes/updates tests,
    • iterates until it passes,
    • and (ideally) explains what it changed and why.

This is why SWE-bench (and especially Verified) is often used as a proxy for “can it fix real bugs in real repos”.

A second proxy: Terminal-Bench 2.0

Agentic coding often happens in a terminal: installing deps, running test suites, grepping logs, editing config, etc.

Terminal-Bench 2.0 focuses specifically on this terminal-style, tool-using, end-to-end competence. It’s a suite of interactive tasks run inside containers, designed to test whether an agent can actually finish realistic command-line work.

Key references

4) A benchmark snapshot (Sonnet 4.5 vs Sonnet 4.6 vs others)

Anthropic’s Sonnet 4.6 system card includes a table that reports (among other things):

  • SWE-bench Verified
  • Terminal-Bench 2.0

Here are the headline rows (percent pass rate):

Model SWE-bench Verified Terminal-Bench 2.0
Claude Sonnet 4.6 79.6% 59.1%
Claude Sonnet 4.5 77.2% 51.0%
Claude Opus 4.6 80.8% 65.4%
Claude Opus 4.5 80.9% 59.8%
Gemini 3 Pro 76.2% 56.2%
GPT-5.2 (all models) 80.0% 64.7%

Source: Claude Sonnet 4.6 System Card (table and surrounding methodology): https://anthropic.com/claude-sonnet-4-6-system-card

Note: SWE-bench Verified scores in that table are averaged over multiple trials; details and scaffolds matter.

5) The major differences: Sonnet 4.5 → Sonnet 4.6 (for coding)

Let’s translate numbers and “model release notes” into what you feel day-to-day.

5.1) Higher “real repo bugfix” success rate

On SWE-bench Verified:

  • Sonnet 4.5: 77.2%
  • Sonnet 4.6: 79.6% That’s a +2.4 percentage point gain.

If you think in “issues solved out of 500”, that’s roughly:

  • 4.5: ~386/500
  • 4.6: ~398/500 (Approximation, because the benchmark is averaged over trials and not always a single deterministic run.)

Why it matters: SWE-bench Verified is closest to “take an issue, ship a patch, tests pass.” A couple points here can mean fewer dead-ends when you delegate real bugfix work.

5.2) Much better terminal-loop performance (the agentic part)

On Terminal-Bench 2.0:

  • Sonnet 4.5: 51.0%
  • Sonnet 4.6: 59.1% That’s +8.1 points — the bigger jump.

Why it matters: in agentic workflows, the bottleneck is often iteration:

“run tests → inspect failure → adjust → rerun”

If your model is weak at CLI-level completion, it will burn tokens/time in loops without converging.

5.3) More “tool-first” behavior (and better guidance for it)

In the Sonnet 4.6 system card, Anthropic shows a prompt modification that improves SWE-bench Verified, emphasizing:

  • use tools heavily,
  • write your own tests early,
  • explore the codebase,
  • fix root causes, not symptoms.

Even if you don’t copy that prompt verbatim, it reflects the direction: Sonnet 4.6 is designed to behave more like a tool-using agent — and it responds well to agent-style instructions.

5.4) Better planning under long context

Anthropic markets Sonnet 4.6 as a “full upgrade” across:

  • coding,
  • long-context reasoning,
  • agent planning,
  • computer use.

This matters for real repos because “coding ability” is often limited by:

  • missing a constraint in a long issue thread,
  • misunderstanding a config file,
  • failing to connect test failures across modules,
  • or breaking an edge case you didn’t see in docs.

Reference: Sonnet 4.6 release post: https://www.anthropic.com/news/claude-sonnet-4-6

6) How to interpret these numbers without fooling yourself

Benchmarks like SWE-bench Verified and Terminal-Bench are not just “model IQ scores”. They combine:

  • model capability,
  • agent scaffold (tools, harness, permissions),
  • prompting and retry policies,
  • environment stability (deps, timeouts, etc.).

So use them like you’d use performance tests in engineering:

  • good for regression detection,
  • good for “directionally better/worse”,
  • risky to over-generalize to your stack.

7) Practical takeaway: when Sonnet 4.6 is the better choice

If you do any of these regularly, Sonnet 4.6’s improvements tend to show up immediately:

  • “Fix this failing CI pipeline”
  • “Patch this bug in a real repo”
  • “Refactor this module and keep tests green”
  • “Upgrade dependencies and adjust code until it builds”
  • “Ship a PR-sized change with minimal handholding”

And if your workflow is heavily terminal/tool based, the Terminal-Bench jump is the most relevant signal.

8) If you want to evaluate models for your own codebase

Here’s a lightweight “internal SWE-bench” you can run in a weekend:

  1. Collect ~20 real issues/bugs from your history (closed tickets, failing tests, regressions).
  2. For each issue, define:
    • starting commit
    • reproduction steps
    • success condition (tests, snapshot, output)
  3. Run each model with the same scaffold:
    • same tools (read/write files, run tests, search)
    • same retry policy
    • same time budget
  4. Track:
    • success rate
    • time-to-green
    • number of tool calls
    • diff size (patch quality proxy)
    • “regressions introduced” (new failing tests)

This will tell you far more than any single public leaderboard.


Final thoughts

SWE-bench Verified tells you: “can it patch real repos reliably?”

Terminal-Bench tells you: “can it survive the messy terminal loop?”

Between Sonnet 4.5 and 4.6, the story is pretty consistent:

  • a modest gain in verified bugfix success,
  • a big gain in terminal/agent loop competence,
  • and more emphasis on agentic, tool-first workflows.

If you’re using Cursor/Claude Code or any tool-using dev agent, Sonnet 4.6 is the more practical daily driver — and the benchmarks explain why.

Top comments (0)