DEV Community

gentic news
gentic news

Posted on • Originally published at gentic.news

SWE-Explore: AI coding agents find files but miss 81-86% of critical lines

SWE-Explore benchmark shows Claude Code, Codex cover only 14-19% of critical lines despite finding the right file. Model strength doesn't fix the structural weakness.

SWE-Explore tests 848 bug-fixing tasks across 203 open-source projects. Claude Code, Codex 5.3, and OpenHands all find the right file but cover only 14-19% of critical lines.

Key facts

  • SWE-Explore: 848 problems from 203 open-source projects.
  • Claude Code, Codex cover only 14-19% of critical lines.
  • Python dominates with 547 of 848 tasks.
  • File hit rates stay high; line-level accuracy collapses.
  • Six different models tested; pattern holds across all.

An international research team led by Shanghai Jiao Tong University released SWE-Explore, a benchmark that isolates code search from the actual repair phase. The core finding: AI coding agents reliably identify the correct source file, but their line-level coverage collapses to 14-19% of the lines that matter. According to the source

The benchmark uses 848 problems from 203 open-source projects across 10 languages (Python dominates with 547 tasks, followed by Go, JavaScript, and Rust). For each problem, at least two successful solution runs from models like GPT-5.4, Gemini 3 Pro, Claude Sonnet 4.6, or Kimi K2.6 establish the ground-truth set of relevant code sections. Passages that multiple independent solution paths converge on are marked as critical context.

Key Takeaways

  • SWE-Explore benchmark shows Claude Code, Codex cover only 14-19% of critical lines despite finding the right file.
  • Model strength doesn't fix the structural weakness.

File-level success, line-level failure

Traditional keyword search barely beats chance—the authors show a bug description like "RuntimeWarning on Overflow" matches templates and docs more often than actual source code. AI agents pull ahead by searching step-by-step rather than sorting all hits at once.

But the moment evaluation zooms from file-level to line-level, the systems fall apart. General coding agents (Claude Code, Codex, OpenHands) plus four research systems designed specifically for code search all land in the same band: 14-19% line coverage. The various agent architectures "land strikingly close to each other," per the paper.

Model strength doesn't fix it

The team ran the same agent architecture with six different models from OpenAI, Anthropic, Google, Moonshot, and Zhipu. GPT-family models lead, but the pattern holds: file hit rates stay high while line coverage remains low. Throwing a stronger language model at the problem doesn't close the gap.

Pipeline diagram of SWE-Explore showing benchmark construction on the left, from solved agent runs through read actions, line regions, and consensus t

This finding echoes a June 4 report that Claude Code quality dropped post-Opus 4.6 with ~25% instruction misses, while Codex 5.3 claimed 95% reliability by the same user. The SWE-Explore results suggest the weakness is structural—agents lack the ability to precisely locate the exact lines that need change, regardless of model or architecture.

The benchmark exposes a blind spot in how AI coding is evaluated. Until now, the field judged agents by whether they fixed the bug or not. SWE-Explore shows that even successful fixes may rely on luck or over-broad context rather than precise understanding.

What to watch

Watch for follow-up work from the same team or competitors that attempts to improve line-level coverage. If Anthropic or OpenAI releases an agent that scores above 30% on SWE-Explore, it would signal a genuine architectural breakthrough rather than a model upgrade.

Side-by-side comparison showing a conventional benchmark on the left with its Explore, Patch, and Verify pipeline producing a single Resolve Rate, and


Source: the-decoder.com


Originally published on gentic.news

Top comments (0)