DEV Community

ayame0328
ayame0328

Posted on

SWE-bench PRs Pass Tests but Won't Merge — The Security Gap Nobody's Talking About

METR just dropped a finding that should make every team rethinking their AI coding workflow pause: many SWE-bench-passing pull requests would not actually be merged into main.

The PRs pass automated tests. They solve the issue. But when human reviewers look at them, they find code that's brittle, over-engineered, or — and this is the part that keeps me up at night — silently insecure.

I've been building a security scanner specifically for AI-generated code for the past two weeks, and this research validates exactly what I've been seeing in the wild.

What METR Actually Found

The METR study evaluated AI-generated PRs that technically passed SWE-bench's test suites. The results:

  • PRs solved the stated problem ✅
  • PRs passed existing tests ✅
  • PRs would be accepted by human reviewers ❌

The gap between "tests pass" and "this is production-ready code" turns out to be enormous. And security lives right in that gap.

Why Tests Don't Catch Security Issues

Here's something I learned the hard way while building CodeHeal's scan engine.

I started by running 6 sample files through my scanner — code that looked perfectly functional. Two files had bugs my rules missed initially:

  1. A shell command using unquoted variable expansion in rm -rf $DIR — tests passed because the test environment had no spaces in paths
  2. A fetch() call with user-controlled URLs — tests passed because the test server was localhost

Both would have sailed through any CI pipeline. Both were real vulnerabilities.

The fundamental problem: test suites verify behavior, not intent. An AI model that generates eval(userInput) can write a perfect test for it — because the test just checks that eval works. Nobody asked whether eval should be there.

The Patterns I Keep Seeing

After scanning hundreds of AI-generated code snippets, certain patterns repeat with alarming frequency:

1. Hardcoded secrets that "work"
AI models love embedding API keys directly in code. The app works. Tests pass. The key is on GitHub within minutes.

2. Overly permissive CORS
Access-Control-Allow-Origin: * appears in almost every AI-generated Express/Next.js backend I've scanned. It "works" for development. It's a security hole in production.

3. SQL queries without parameterization
The AI generates SELECT * FROM users WHERE id = ${userId}. It works. Tests pass (they use clean test data). SQL injection waiting to happen.

4. Missing input validation at trust boundaries
AI-generated code tends to trust all inputs. No sanitization, no length limits, no type checking at API boundaries. The happy path works perfectly.

5. Prototype pollution in object merging
Deep merge utilities that recursively copy properties without checking __proto__ or constructor. Tests pass because test objects are clean.

What This Means for Your Team

If your team is adopting AI coding assistants (and statistically, you probably are), the METR finding means:

  1. Your test suite is not a security gate. Tests verify functionality, not safety.
  2. Code review is your last line of defense. But reviewers are increasingly trusting AI output because "it passed CI."
  3. You need automated security scanning that understands AI-generated patterns. Generic SAST tools flag known CVEs. They don't flag the subtle, "technically works" patterns that AI models produce.

The Google-Wiz Acquisition Context

This week also saw Google officially closing its acquisition of Wiz — a cloud security company valued at reportedly $32 billion. The security market is exploding precisely because the attack surface is expanding faster than teams can manually review.

AI-generated code is the next frontier of that expanding attack surface. And unlike human-written vulnerabilities that follow somewhat predictable patterns, AI-generated vulnerabilities are novel combinations that traditional scanners weren't designed to catch.

What I'm Doing About It

I built CodeHeal specifically for this problem. No LLM in the loop (ironic, I know) — pure static analysis with rules designed around the patterns AI models actually produce.

The scanner checks 14 vulnerability categories with 93+ detection rules. It's deterministic — same code, same results, every time. No API costs, no "it depends on the model's mood."

The hardest part wasn't building the rules. It was accepting that existing tools weren't enough. I spent my first week trying to configure Semgrep and ESLint to catch AI-specific patterns. They're great tools, but they're designed for human-written code patterns. The subtle "works but shouldn't" patterns that AI generates needed a purpose-built approach.


Scan Your Code Now

The METR finding isn't theoretical. If you're shipping AI-generated code that "passes tests," you likely have vulnerabilities sitting in production right now.

CodeHeal catches the patterns that test suites miss — hardcoded secrets, injection vectors, overly permissive configs, and 90+ other AI-specific vulnerability patterns. No LLM, no API costs, deterministic results.

Try CodeHeal free →

Top comments (0)