DEV Community

Breach Protocol
Breach Protocol

Posted on • Originally published at groundtruth.day

Can an AI Agent Reproduce Real Science? A New Test Says: Rarely

NatureBench, a new benchmark that tests AI coding agents against ninety computational tasks drawn from peer-reviewed Nature-family journals, finds that the best models match or beat published state-of-the-art results on fewer than one in five tasks. When agents succeed, they do so by reshaping unfamiliar scientific problems into familiar templates — not by inventing new science.

Key facts

  • What: A new benchmark points coding agents at the actual computational results behind ninety papers in top journals. The strongest models matched the published science on fewer than one in five.
  • When: 2026-06-24
  • Primary source: read the source (arXiv 2606.24530)

The researchers selected ninety computational tasks from published papers in the Nature family of journals — among the most scrutinized science anywhere. Each task presents the original data and scientific question and asks whether the agent can reproduce the finding that human researchers achieved and expert reviewers accepted. They then set today's strongest AI coding agents — the kind that write and run their own programs — loose on those tasks. To keep scoring fair and repeatable, the team built an automated system that wraps each task in a standardized environment, grading every agent the same way. That rigor matters; sloppy benchmarks are a real problem, as we explored in the story about how the leaderboard can be lying, and it connects to the broader question of how AI gets benchmarked at all.

On the large majority of tasks, the agents fell short. The way they succeeded when they did is the most revealing part: agents win not by inventing new science, but by quietly translating a scientific problem into a familiar shape they already know how to handle — turning a novel question into a standard prediction exercise they have seen a thousand times. When real scientific invention was required, they mostly failed. The common failure modes were mundane: picking the wrong method for the problem, or simply not having enough computing power to finish the job properly.

The pattern is the difference between a brilliant student and a working scientist. A strong student can crush any problem that resembles their homework, because they recognize the template and apply it flawlessly. A scientist's actual job begins where the templates run out — when the problem does not look like anything in the textbook and you have to invent the approach. NatureBench shows that today's agents are superb students and not yet scientists: excellent at converting the unfamiliar into the familiar, and stuck when the unfamiliar refuses to be converted.

Enormous hype and serious money ride on the idea that AI is about to accelerate scientific discovery. This benchmark does not say that is impossible, but it draws a sharp, honest line around where the technology actually stands. Reproducing published results is, in an important sense, the easy version of the dream — the answer already exists and is known to be correct. If agents can match top-tier published work on only a small fraction of cases, the harder dream of generating genuinely new, correct discoveries is further off than the most excited headlines imply. It is a healthy corrective to a field that loves to extrapolate, and it complements other recent work pushing agents toward real lab science, like the systems that run their own experiments.

The caveat cuts both ways. On the skeptical side, a benchmark is a snapshot, and these agents are improving quickly — a score that looks modest today can climb fast, and "fewer than one in five" a year from now could read very differently. On the other side, even this number deserves scrutiny: matching a published computational result is not the same as independently validating that the result is true, and an agent that hits the target by translating problems into familiar templates may be gaming the format rather than doing science. The real value here is not the score but the diagnosis — a clear, reproducible account of how these agents win and how they fail, which is worth more than any single percentage. It gives the field a concrete place to push next, instead of another round of vague claims about machines on the cusp of discovery.


Originally published on Ground Truth, where every claim is checked against the primary source.

Top comments (0)