SciCode: Epoch AI Launches Benchmark Measuring AI Research Ability

#ai #machinelearning #research #deeplearning

Epoch AI launched SciCode benchmark testing LLMs on real research coding tasks. Top models score below 30%, exposing gap between coding benchmarks and scientific ability.

Epoch AI launched SciCode, a benchmark for evaluating LLMs on real scientific research coding tasks. Early results show top models scoring below 30%, highlighting the gap between coding benchmarks and genuine research ability.

Key facts

Top LLMs score below 30% on SciCode benchmark.
SciCode includes problems from physics, chemistry, biology.
Epoch AI designed difficulty scaling by reasoning steps.
Benchmark aims to measure genuine scientific discovery ability.

Epoch AI has released SciCode, a new benchmark designed to test whether large language models can perform real scientific research coding. Unlike existing benchmarks that focus on algorithmic puzzles or software engineering tasks, SciCode requires models to solve problems drawn from actual research papers across physics, chemistry, and biology. The benchmark includes tasks such as implementing simulation code, analyzing experimental data, and reproducing key figures from published studies.

Early results reveal a significant gap between current LLM capabilities and the requirements of scientific research. Even the best-performing models, including GPT-5.5 and Gemini 3.5 Pro, scored below 30% on SciCode, according to Epoch AI's evaluation. This compares to scores above 80% on standard coding benchmarks like SWE-Bench and HumanEval, suggesting that existing evaluations overstate models' ability to contribute to scientific work.

Why SciCode matters

The benchmark addresses a growing tension in AI research: while LLMs are increasingly promoted as tools for scientific discovery, their evaluation has remained narrow. SciCode's design forces models to combine coding proficiency with domain knowledge and multi-step reasoning, mirroring the workflow of a research scientist. For example, one task requires implementing a Monte Carlo simulation from a condensed matter physics paper and reproducing a phase transition plot — a challenge that demands both physics understanding and coding skill.

Epoch AI's approach also includes a novel difficulty scaling mechanism. Problems are categorized by the number of reasoning steps required, the level of domain knowledge needed, and the length of the code solution. This allows researchers to track progress across specific dimensions of scientific capability.

Implications for AI development

The low scores on SciCode have practical implications for AI adoption in research settings. Companies like Google and OpenAI have positioned their models as scientific assistants, but SciCode suggests that current systems remain unreliable for tasks requiring deep domain integration. According to Epoch AI, the benchmark is designed to evolve as models improve, with new problems added from recent publications to prevent saturation.

The benchmark also highlights a structural weakness in current LLM training: models are trained on vast amounts of code from repositories like GitHub, but scientific code is often idiosyncratic, poorly documented, and requires understanding of the underlying theory. SciCode's results suggest that scaling alone may not close this gap, and that targeted training on scientific workflows may be necessary.

Key Takeaways

Epoch AI launched SciCode benchmark testing LLMs on real research coding tasks.
Top models score below 30%, exposing gap between coding benchmarks and scientific ability.

What to watch

Watch for the first model to surpass 50% on SciCode, which would indicate a meaningful advance in AI's research capability. Epoch AI plans to update the benchmark with new problems quarterly from recent preprints.

Source: news.google.com

[Updated 28 Jun via epoch_ai_gradient_updates_gn]

Epoch AI also unveiled MirrorCode, a benchmark that tests whether AI can rebuild entire software projects solely from observing program behavior, without access to source code. Early results show models successfully reconstructing programs up to 10,000 lines, but struggle beyond that threshold [per Epoch AI]. MirrorCode complements SciCode by measuring AI's ability to reverse-engineer and replicate existing codebases, a skill critical for understanding legacy scientific software.

Originally published on gentic.news