AttractivePenguin

Posted on Mar 12

SWE-bench Scores Are Lying to You: Half of Passing PRs Wouldnt Be Merged

#ai #programming #javascript #devtools

SWE-bench Scores Are Lying to You: Half of "Passing" PRs Wouldnt Be Merged

The Benchmark Thats Fooling Everyone

If youve been excited about SWE-bench scores lately, heres a reality check: roughly half of the PRs that pass SWE-bench would not be merged into a real codebase.

A new study from METR had maintainers from scikit-learn, Sphinx, and pytest review 296 AI-generated PRs that passed SWE-benchs automated grader. The result? Only about 50% would actually get merged.

Let that sink in.

The 24-Point Gap

The study found that maintainer merge decisions are 24 percentage points lower than what SWE-benchs automated grader reports. Heres what that looks like:

SWE-bench Automated Grader: ████████████████░░░░░░░ ~72%
Actual Maintainer Merge:    █████████░░░░░░░░░░░░░ ~48%

The gap isnt small — its massive. And its not because AI cant solve the problems. Its because real code review is about more than just does this pass the tests?

Why PRs Get Rejected

The maintainers flagged AI-generated PRs for these reasons:

Issue	What It Means
Code Quality	Doesnt follow repo standards, style, or conventions
Breaks Other Code	Solves the issue but breaks something else
Core Functionality	Doesnt actually solve the problem correctly
Fails Automated Grader	The easy filter — didnt even pass tests

The surprising part? Core functionality failures — cases where the PR看起来works but doesnt actually solve the issue — were relatively rare. The bigger problems were code quality and breaking other code.

What This Means for You

If Youre Using AI Coding Assistants

Dont trust the benchmark scores — a 60% SWE-bench score doesnt mean the AI can resolve 60% of your real issues.
Human review is still essential — even when AI writes correct code, it might not meet your projects standards.
Set realistic expectations — AI can help, but its not replacing senior devs yet.

If Youre Evaluating AI Tools

The lesson here is simple: benchmarks are one signal, not the whole picture. A model that scores 10% higher on SWE-bench might not actually produce 10% more usable code.

The Real Improvement Rate

Heres the concerning part: the study found that the improvement rate for maintainer merge decisions is 9.6 percentage points per year slower than what the automated grader shows.

Automated Grader improvement: ~15 pp/year
Maintainer merge improvement: ~5 pp/year

This suggests that as models get better at passing tests, they are not necessarily getting better at producing production-ready code at the same rate.

The Bottom Line

SWE-bench is useful — its just not telling you what you think its telling you. A passing PR is a starting point, not a finish line.

If you are building AI-powered development tools or relying on AI assistants, the gap between benchmark performance and real-world code quality is the most important number to watch.

Because at the end of the day, what matters is not whether the tests pass — its whether your maintainers will actually merge the PR.

This post was generated automatically from research published by METR on March 10, 2026.

DEV Community