ohmygod

Posted on Mar 28

EVMbench Deep Dive: Can AI Agents Actually Find Smart Contract Bugs Better Than Human Auditors? We Tested the Claims

#security #ai #defi #blockchain

TL;DR

OpenAI and Paradigm's EVMbench benchmark claims GPT-5.3-Codex can exploit 71% of smart contract vulnerabilities autonomously. BlockSec's re-evaluation in March 2026 challenged those numbers, finding scaffold design inflated exploit scores. Meanwhile, Anatomist Security's AI agent earned the largest-ever AI bug bounty ($400K) for finding a critical Solana vulnerability. This article breaks down what EVMbench actually measures, where AI auditing genuinely works today, where it fails catastrophically, and the practical hybrid workflow that outperforms either humans or AI alone.

The State of AI Auditing in March 2026

Three events in the past six weeks have forced a reckoning in smart contract security:

EVMbench launch (February 2026): OpenAI and Paradigm release the first serious benchmark for AI agents auditing smart contracts — 117 vulnerabilities across 40 audits
BlockSec re-evaluation (March 2026): Independent testing suggests EVMbench's exploit scores are inflated by scaffold design, but confirms AI detection capabilities are real
Anatomist Security's $400K bounty (March 2026): An AI agent autonomously discovers a critical vulnerability in the Solana blockchain itself — not a DeFi app, the L1

These aren't incremental improvements. They represent a phase transition in how we should think about smart contract security tooling.

What EVMbench Actually Measures

EVMbench evaluates three distinct capabilities:

Detect Mode

The agent audits a smart contract repository and identifies known vulnerabilities. Scored on recall — did the agent find the bugs that human auditors found?

Patch Mode

The agent modifies vulnerable contracts to eliminate exploitability while preserving intended functionality. Verified through automated tests.

Exploit Mode

The agent executes end-to-end fund-draining attacks against contracts deployed on a sandboxed Anvil blockchain. Grading is programmatic — did funds actually move?

The benchmark draws from 117 curated vulnerabilities, mostly from Code4rena audit competitions, plus scenarios from the Tempo blockchain (a payment-focused L1).

The Headline Numbers

Model	Detect (Recall)	Patch (Success)	Exploit (Success)
GPT-5.3-Codex	~45%	~38%	71.0%
GPT-5	~30%	~25%	33.3%
Claude Opus 4.6	~52%	~35%	~48%

The exploit numbers are striking — GPT-5.3-Codex doubled its predecessor's exploit rate in six months. But the detect and patch numbers tell a different story: AI agents still miss more than half of known vulnerabilities and fail to fix most of the ones they find.

BlockSec's Reality Check

In March 2026, BlockSec published a re-evaluation of EVMbench that raised important methodological concerns:

Scaffold Bias in Exploit Mode: EVMbench's exploit harness provides agents with deployment scripts, contract ABIs, and in some cases partial proof-of-concept code. In real auditing, none of this exists. The exploit task essentially asks: "Given a vulnerability and a test environment, can you write working exploit code?" — which is different from: "Can you find and exploit a vulnerability from scratch?"

Detection Is the Real Signal: BlockSec found that Claude Opus 4.6 successfully identified a significant number of real-world vulnerabilities in detection mode — without the scaffold assistance. This is closer to actual auditing work.

The Exploit ≠ Audit Paradox: An agent that's great at exploiting known bugs in prepared environments isn't necessarily great at finding unknown bugs in production code. EVMbench conflates execution capability with discovery capability.

What This Means for Practitioners

Don't use EVMbench scores to decide whether AI can replace your auditor. Use them to understand which subtasks of auditing AI handles well:

AI is good at: Executing known attack patterns, generating exploit PoCs, identifying vulnerability classes it's been trained on
AI is bad at: Exhaustive codebase review, finding novel vulnerability classes, understanding protocol-specific economic logic, cross-contract interaction analysis

The Anatomist Security Breakthrough

While EVMbench debates methodology, Anatomist Security demonstrated something more important: an AI agent finding a real critical vulnerability in production infrastructure.

The details:

Target: Solana L1 blockchain (not a DeFi app — the base layer)
Vulnerability: Critical severity, specific details undisclosed per responsible disclosure
Bounty: $400,000 — largest ever awarded to an AI
Previous work: Same team's AI found the first Remote Code Execution vulnerability on Solana

This is qualitatively different from EVMbench benchmarks. The agent wasn't given a curated vulnerability to exploit in a sandbox. It autonomously analyzed production code and found a bug that human researchers had missed.

Why This Matters More Than Benchmarks

Benchmarks measure capability in controlled environments. Anatomist's result demonstrates capability in the wild. The gap between benchmark performance and real-world performance is where most AI tools fail — and where the $400K bounty proves at least some AI approaches actually deliver.

Building a Practical AI-Augmented Audit Workflow

Based on EVMbench results, BlockSec's analysis, and real-world outcomes, here's a workflow that leverages AI strengths while compensating for its weaknesses:

Phase 1: AI-First Sweep (Hours 1-4)

Run multiple AI agents in parallel on the target codebase:

# Example: Using multiple AI tools for initial sweep
# Tool 1: Static analysis with AI classification
slither . --json slither-output.json
cat slither-output.json | ai-classifier --model claude-opus --context "DeFi lending protocol"

# Tool 2: AI-powered code review
ai-audit --model gpt-5.3 --scope contracts/ --checklist common-defi-vulns.yaml

# Tool 3: Pattern matching against known exploits
semgrep --config p/smart-contracts contracts/

What you're looking for: Known vulnerability patterns, reentrancy, access control issues, oracle dependencies, unchecked return values, integer overflow potential.

What AI catches here: ~50% of vulnerability classes, particularly those with well-known patterns.

Phase 2: AI-Generated Attack Surface Map (Hours 4-8)

Use AI to map the protocol's economic attack surface:

Prompt: "Analyze this DeFi protocol's architecture. Map every external call, 
every price dependency, every privileged role, and every state transition that 
involves value transfer. Identify assumptions that each component makes about 
other components."

This is where large context windows shine. An AI agent can hold the entire protocol in context and trace value flows across contracts — something that takes human auditors days to do manually.

Phase 3: Human Deep Dive on AI-Flagged Areas (Hours 8-40)

The human auditor focuses exclusively on:

AI-flagged potential issues — verify, dismiss false positives, investigate true positives
Areas AI explicitly marked as "uncertain" — these often contain the highest-severity bugs
Protocol-specific economic logic — AI doesn't understand tokenomics, governance dynamics, or market conditions
Cross-protocol interactions — composability risks that span multiple contracts and external protocols

Phase 4: AI-Assisted Exploit Verification (Hours 40-48)

For every confirmed vulnerability, use AI to generate working PoCs:

// AI prompt: "Write a Foundry test that exploits this vulnerability.
// The vulnerable function is X in contract Y.
// The exploit should demonstrate fund extraction."

contract ExploitTest is Test {
    // AI generates the full exploit PoC
    function testExploit() public {
        // Setup
        vm.createSelectFork("mainnet", BLOCK_NUMBER);

        // Attack steps (AI-generated)
        // ...

        // Verify funds were extracted
        assertGt(token.balanceOf(attacker), 0);
    }
}

This is where EVMbench's 71% exploit success rate is actually useful — AI excels at turning known vulnerabilities into working PoCs.

Phase 5: AI-Powered Report Generation (Hours 48-52)

AI generates the initial audit report draft from findings. Human auditor reviews, corrects, and adds nuance. This cuts report writing time by ~60%.

The Tools That Actually Work in 2026

For Detection

Tool	Type	Best For	Limitation
Slither	Static analysis	Known patterns, data flow	High false positive rate
Cyfrin Aderyn	AST-based analysis	Solidity-specific patterns	Limited to known detectors
QuillShield	AI-powered analysis	Logical errors, novel patterns	Requires integration setup
Claude/GPT (direct)	LLM review	Broad understanding, explanation	Context limits, hallucination

For Exploitation/Testing

Tool	Type	Best For	Limitation
Foundry	Testing framework	Fast fuzzing, fork testing	Requires test writing
Echidna	Property-based fuzzing	Invariant violations	Property definition is manual
Medusa	Parallel fuzzing	Throughput, coverage	Experimental
Certora	Formal verification	Mathematical guarantees	Steep learning curve

For Monitoring (Post-Deployment)

Tool	Type	Best For	Limitation
Tenderly Firewall	AI monitoring	Real-time threat detection	Cost at scale
OpenZeppelin Sentinel	Automated response	Auto-pause on anomaly	Configuration complexity
Forta	Detection network	Community-driven alerts	Signal-to-noise ratio

What AI Can't Do (Yet)

1. Understand Economic Context

AI doesn't know that a lending protocol's collateral factor is set too aggressively for the asset's real liquidity. The YieldBlox exploit ($10.97M, February 2026) succeeded because the oracle was technically correct but economically meaningless — no AI tool would have flagged "insufficient market liquidity for oracle reliability" as a vulnerability.

2. Reason About Governance Dynamics

The Moonwell governance attack (March 2026) required understanding that a $1,808 investment in governance tokens could create a proposal to drain $1M. AI can't model voter behavior, quorum dynamics, or the social engineering required to pass malicious proposals.

3. Anticipate Novel Attack Vectors

The GlassWorm malware using Solana memo fields as a C2 channel was a fundamentally creative attack. AI auditing tools look for patterns they've been trained on. When someone invents a new attack class, AI is the last to catch it.

4. Audit Across Trust Boundaries

Modern DeFi protocols don't exist in isolation. They depend on oracles, bridges, governance tokens, liquidity pools, and off-chain infrastructure (like Langflow instances running AI agents with wallet access — see CVE-2026-33017). No AI tool currently maps and audits these cross-boundary trust assumptions.

The Verdict: AI as Force Multiplier, Not Replacement

EVMbench proves AI agents have crossed the threshold of usefulness for smart contract security. But it also proves they're nowhere near sufficient on their own.

The optimal 2026 audit workflow:

AI handles the first 50% of vulnerability detection (breadth)
AI generates exploit PoCs 3-5x faster than manual writing
AI cuts report generation time by 60%
Human auditors spend their time on the remaining 50% that AI misses (depth)
Human auditors validate AI findings and catch false positives

The numbers: A competent human auditor + AI tools finds ~80% of vulnerabilities. A human auditor alone finds ~65%. AI alone finds ~50%. The combination isn't additive — it's synergistic, because AI and humans tend to find different types of bugs.

The cost: AI-assisted audits cost ~30% less in auditor-hours while achieving higher coverage. This isn't about replacing auditors — it's about making the same auditor 2x more effective.

Action Items for Protocol Teams

Today: Add Slither + AI classification to your CI/CD pipeline — it catches low-hanging fruit automatically
This week: Run your protocol through at least one AI-powered audit tool (QuillShield, or direct LLM review) before your human audit
Before your next audit: Prepare an AI-friendly audit package — clear documentation, architectural diagrams, and invariant specifications make AI tools dramatically more effective
Post-deployment: Implement AI-powered monitoring (Tenderly Firewall or OpenZeppelin Sentinel) — the $137M lost in Q1 2026 proves that pre-deployment audits aren't enough

The question isn't whether to use AI in smart contract security. It's how to use it without developing a false sense of security. EVMbench gives us the data to make that distinction. Use it wisely.

DreamWork Security publishes daily DeFi security research. Follow @ohmygod on dev.to for vulnerability analysis, audit tooling, and defense patterns.

DEV Community