TL;DR
OpenAI and Paradigm released EVMbench in February 2026 — an open benchmark that measures how well AI agents detect, patch, and exploit smart contract vulnerabilities. GPT-5.3-Codex now solves 71% of exploit tasks (up from 33% just six months prior). If you're a security researcher or DeFi developer and you're not incorporating AI-assisted auditing into your workflow, you're leaving money on the table — and leaving bugs in your code.
This article breaks down what EVMbench actually tests, what the results tell us about the current state of AI-powered auditing, and how you can use it defensively today.
Why EVMbench Matters
Smart contracts secure over $100 billion in open-source crypto assets. The attack surface is enormous, the stakes are real, and human auditors can't scale fast enough.
In Q1 2026 alone:
- $86M lost to DeFi exploits in January
- $49.3M lost in February
- Attack vectors ranging from oracle manipulation (YieldBlox, $10.2M) to ZK-proof misconfigurations (FOOMCASH, $2.26M) to good old private key compromises
The security community needed a standardized way to measure whether AI agents are actually getting better at finding these bugs — or just hallucinating false positives. EVMbench is that standard.
The Three Modes
EVMbench evaluates AI agents across three distinct capability modes, each testing a different part of the security workflow:
🔍 Detect Mode
The agent audits a smart contract repository and tries to identify known vulnerabilities. Scoring is based on recall — did you find the bugs that human auditors found?
Current limitation: Agents tend to stop after finding one issue rather than exhaustively auditing the codebase. This mirrors a common human auditor mistake too — anchoring on the first bug and losing focus.
🔧 Patch Mode
The agent must modify vulnerable contracts to eliminate exploitability while preserving intended functionality. This is verified through automated tests.
Why this is hard: Subtle vulnerabilities often interleave with core business logic. A naive patch can break the protocol. The agent needs to understand not just what's wrong but why the code was written that way.
💀 Exploit Mode
The agent executes end-to-end fund-draining attacks against deployed contracts on a sandboxed Anvil environment. Grading is programmatic — did the funds actually move?
This is where AI shines. The objective is explicit and the feedback loop is tight: keep iterating until the exploit works. GPT-5.3-Codex hits 71% here, up from GPT-5's 33.3%.
What the Results Tell Us
| Mode | GPT-5 (Aug 2025) | GPT-5.3-Codex (Feb 2026) |
|---|---|---|
| Exploit | 33.3% | 71.0% |
| Detect | Low | Improved but incomplete |
| Patch | Low | Still challenging |
The asymmetry is striking: AI is better at attacking than defending. Exploit mode has a clear optimization target (drain the funds), while detect and patch require judgment calls about what constitutes a vulnerability vs. a design choice.
This has profound implications:
- Attackers will use AI first. The exploit capability is improving 2x every six months.
- Defenders need to keep pace. If you're not scanning your contracts with AI before deployment, an attacker's AI will scan them after.
- Detection needs work. The \"stop after one bug\" behavior means AI audits should be run multiple times with different prompting strategies.
How to Use EVMbench Defensively (Practical Guide)
Here's how security teams can leverage EVMbench and the broader AI auditing trend:
1. Run EVMbench Against Your Own Contracts
The benchmark is open-source on GitHub. Clone it, study the task structure, and adapt the evaluation harness for your own codebase:
# Clone the benchmark
git clone https://github.com/openai/evmbench
cd evmbench
# Study the task format — each task has:
# - Contract source code
# - Deployment scripts
# - Exploit graders
# - Expected vulnerability metadata
The Rust-based harness deploys contracts, replays transactions deterministically, and restricts unsafe RPC methods. Use it as a template for your CI/CD security pipeline.
2. Multi-Pass AI Auditing
Since agents tend to anchor on a single vulnerability, run multiple passes:
Pass 1: "Audit this contract for reentrancy and access control issues"
Pass 2: "Audit this contract for oracle manipulation and flash loan vectors"
Pass 3: "Audit this contract for integer overflow and rounding errors"
Pass 4: "Assume all external calls are adversarial. What can go wrong?"
This targeted approach catches more than a single \"find all bugs\" prompt.
3. Red-Team Your Own Deployments
Use exploit mode thinking against your own contracts:
- Deploy to a local Anvil fork
- Give an AI agent your contract addresses and ABIs
- Ask it to drain funds
- If it succeeds, you have a bug. If it doesn't, you have some confidence (but not certainty)
4. Combine AI with Traditional Tools
AI auditing doesn't replace your existing stack — it layers on top:
| Tool | What It Catches | AI Complement |
|---|---|---|
| Slither | Static analysis patterns | AI catches logic bugs Slither misses |
| Foundry fuzzing | Input-dependent crashes | AI identifies meaningful exploit paths |
| Halmos/Certora | Formal properties | AI suggests which properties to verify |
| Manual review | Business logic | AI does the tedious first pass |
5. Track the Benchmark Over Time
EVMbench scores are improving fast. Set a calendar reminder to re-evaluate every quarter:
- If exploit scores reach 90%+, any unaudited contract is effectively pre-exploited
- If detect scores catch up to exploit scores, AI becomes a net defender
- Watch for new benchmark extensions (cross-chain, L2-specific, Solana)
The OpenZeppelin Critique
OpenZeppelin published an analysis noting two concerns with EVMbench:
Data contamination: Since vulnerabilities are sourced from public Code4rena competitions, models may have seen them during training. The benchmark may overestimate real-world capability.
Classification accuracy: Some vulnerability categories may be mislabeled, affecting the reliability of detect-mode scoring.
These are valid concerns. EVMbench is a v1 — useful for tracking relative progress between models, less reliable for absolute capability claims.
What's Coming Next
OpenAI committed $10M in API credits for cyber defense research, and their security research agent Aardvark is in private beta. The trajectory is clear:
- Q2 2026: Expect cross-chain and multi-chain EVMbench extensions
- 2026 H2: Solana and Move-based benchmarks will likely follow
- 2027: AI auditors that can catch 90%+ of known vulnerability classes
The question isn't whether AI will transform smart contract security. It's whether you'll be using it as a shield or getting hit by someone else using it as a sword.
Key Takeaways
- EVMbench is the first serious standard for measuring AI security capabilities on smart contracts
- AI is better at exploiting than defending — 71% exploit vs. much lower detect/patch rates
- Multi-pass prompting dramatically improves AI audit coverage
- Layer AI on top of existing tools (Slither, Foundry, formal verification) — don't replace them
- The exploit gap is closing fast — 2x improvement in 6 months means urgency for defenders
Building DeFi? Run an AI audit before your attacker does. The benchmark is open-source. The excuses are not.
Top comments (0)