ohmygod

Posted on Mar 25

EVMbench: OpenAI and Paradigm's New Benchmark Proves AI Agents Can Exploit 71% of Smart Contract Vulns

#security #defi #web3 #smartcontracts

TL;DR

OpenAI and Paradigm just dropped EVMbench, a benchmark that tests whether AI agents can detect, patch, and exploit real smart contract vulnerabilities. The headline number: GPT-5.3-Codex exploits 71% of high-severity vulns in a sandboxed environment, up from 33% with GPT-5 just six months ago.

What Is EVMbench?

EVMbench is an open evaluation framework built on 117 curated vulnerabilities from 40 real audits, primarily sourced from Code4rena audit competitions. It also includes scenarios from the Tempo blockchain, a payments-focused L1, extending coverage into stablecoin and payment contract security.

The benchmark evaluates three capability modes:

Detect - Agents audit a contract repo for vulnerabilities, scored on recall against human-identified ground truth
Patch - Agents fix vulnerable code without breaking functionality, verified through automated tests and exploit checks
Exploit - Agents execute end-to-end fund-draining attacks, graded via transaction replay on sandboxed Anvil chain

The infrastructure is serious: a Rust-based harness deploys contracts, replays transactions deterministically, and restricts unsafe RPC methods.

The Numbers That Matter

Exploit mode:

GPT-5.3-Codex via Codex CLI: 71.0% up from GPT-5's 33.3%
That is a 2.1x improvement in six months

Detect and Patch modes:

Still below full coverage
Agents often stop after finding one vulnerability instead of exhaustively auditing
Patching while preserving functionality remains genuinely hard

The asymmetry is telling: AI agents are better at breaking contracts than fixing them.

Why This Matters for DeFi Security

1. The Attacker-Defender Gap Is Widening

A 71% exploit rate on curated C4 vulnerabilities means AI-assisted attackers can already handle most known vulnerability patterns. The progression from 33% to 71% in six months suggests 85%+ within a year.

For protocol teams: if your contract has a known vulnerability class, assume it will be found and exploited automatically.

2. AI Auditing Is Real But Not Complete

AI agents are decent at finding individual bugs but poor at exhaustive auditing. They exhibit satisficing behavior, finding one issue and declaring victory.

Use case: AI as a first-pass scanner, not a replacement for manual review.

3. The Patch Gap Is the Real Opportunity

Patching scores lag significantly behind exploit scores. This is where human security researchers remain irreplaceable. If you are building audit tooling, automated patch suggestion and verification is the highest-impact area.

Practical Recommendations

For Protocol Teams:

Run AI-assisted scans before your audit (Slither, Mythril, QuillShield, SolidityScan)
Assume known vulnerability patterns are zero-day for you
Invest in formal verification for critical paths (Halmos, Certora)

For Security Researchers:

Use EVMbench to benchmark your own tooling, it is fully open source
Focus on the detect gap, building agents that exhaustively audit is high-value work
Apply for OpenAI's $10M Cybersecurity Grant Program for API credits

For Auditors:

AI is your co-pilot, not your replacement. The 71% is on known, curated vulnerabilities
Update your threat models to assume AI-assisted exploit capability as baseline
Integrate EVMbench scenarios into your training

The Bigger Picture

EVMbench sits at the intersection of two accelerating trends:

AI agents getting dramatically better at code reasoning
DeFi protocols getting more complex with cross-chain bridges, restaking, intent-based architectures, and ZK systems

The security tooling stack must evolve as fast as the attack surface. OpenAI expanding Aardvark (their security research agent) and offering free codebase scanning for open-source projects signals where this is heading: AI-assisted security as infrastructure.

Limitations Worth Noting

Only single-chain environments, no cross-chain exploit scenarios
Clean Anvil state, not mainnet forks
Sequential transaction replay, no timing-dependent attacks
C4 vulnerabilities, not truly novel zero-days

71% on EVMbench is a floor, not a ceiling.

What to Watch

Multi-chain EVMbench extensions for bridge exploits
Economic exploit scenarios (governance attacks, oracle manipulation)
Defense benchmarks measuring ongoing security posture
Integration with Foundry and Hardhat workflows

The race between AI-powered attackers and defenders is the defining security dynamic of 2026. EVMbench gives us a scoreboard. Right now, the attackers are ahead.

DreamWork Security covers smart contract vulnerabilities, audit tools, and DeFi security research.

DEV Community