SWE-bench Scores Are Lying to You: Half of "Passing" PRs Wouldnt Be Merged
The Benchmark Thats Fooling Everyone
If youve been excited about SWE-bench scores lately, heres a reality check: roughly half of the PRs that pass SWE-bench would not be merged into a real codebase.
A new study from METR had maintainers from scikit-learn, Sphinx, and pytest review 296 AI-generated PRs that passed SWE-benchs automated grader. The result? Only about 50% would actually get merged.
Let that sink in.
The 24-Point Gap
The study found that maintainer merge decisions are 24 percentage points lower than what SWE-benchs automated grader reports. Heres what that looks like:
SWE-bench Automated Grader: ████████████████░░░░░░░ ~72%
Actual Maintainer Merge: █████████░░░░░░░░░░░░░ ~48%
The gap isnt small — its massive. And its not because AI cant solve the problems. Its because real code review is about more than just does this pass the tests?
Why PRs Get Rejected
The maintainers flagged AI-generated PRs for these reasons:
| Issue | What It Means |
|---|---|
| Code Quality | Doesnt follow repo standards, style, or conventions |
| Breaks Other Code | Solves the issue but breaks something else |
| Core Functionality | Doesnt actually solve the problem correctly |
| Fails Automated Grader | The easy filter — didnt even pass tests |
The surprising part? Core functionality failures — cases where the PR看起来works but doesnt actually solve the issue — were relatively rare. The bigger problems were code quality and breaking other code.
What This Means for You
If Youre Using AI Coding Assistants
Dont trust the benchmark scores — a 60% SWE-bench score doesnt mean the AI can resolve 60% of your real issues.
Human review is still essential — even when AI writes correct code, it might not meet your projects standards.
Set realistic expectations — AI can help, but its not replacing senior devs yet.
If Youre Evaluating AI Tools
The lesson here is simple: benchmarks are one signal, not the whole picture. A model that scores 10% higher on SWE-bench might not actually produce 10% more usable code.
The Real Improvement Rate
Heres the concerning part: the study found that the improvement rate for maintainer merge decisions is 9.6 percentage points per year slower than what the automated grader shows.
Automated Grader improvement: ~15 pp/year
Maintainer merge improvement: ~5 pp/year
This suggests that as models get better at passing tests, they are not necessarily getting better at producing production-ready code at the same rate.
The Bottom Line
SWE-bench is useful — its just not telling you what you think its telling you. A passing PR is a starting point, not a finish line.
If you are building AI-powered development tools or relying on AI assistants, the gap between benchmark performance and real-world code quality is the most important number to watch.
Because at the end of the day, what matters is not whether the tests pass — its whether your maintainers will actually merge the PR.
This post was generated automatically from research published by METR on March 10, 2026.
Top comments (0)