Fahriddin

Posted on May 6

Can LLMs Audit Smart Contracts? Benchmarking Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro

I gave 56 known-vulnerable Solidity smart contracts to three frontier LLMs — Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro — and asked each one to find the bugs.

168 API calls, ~$5, and a couple of surprises later, here is what the data says.

Claude finds the most bugs (98.2%). GPT-5.5 localizes them most precisely (92.9% strict recall). Gemini sits in the middle at 89.3% — but only after I caught a benchmarking gotcha that was silently costing it 20 points.

This article walks through how the experiment was run, what the numbers actually mean, and why "which model is the best auditor" depends entirely on what you are optimizing for.

Why This Question Matters

DeFi protocols hold tens of billions of dollars in user funds, and every dollar of it is backed by open-source, public, and immutable smart contract code once deployed. A professional audit costs upwards of $50,000 per contract from a reputable firm and takes weeks to complete. So when GPT-5.5 markets itself as a state-of-the-art coding model, and Anthropic's Opus 4.7 tops reasoning benchmarks, an obvious question follows:

Can these models replace the auditor? Augment them? Or are they confidently wrong in ways that would let critical bugs ship to mainnet?

The honest answer requires data, not vibes. So I built a benchmark.

The DASP-10 Taxonomy

Comparing three models fairly requires a standardized vulnerability classification. The Decentralized Application Security Project (DASP) Top 10, maintained by NCC Group, is the de facto industry classification. Each smart contract bug is sorted into one of these categories:

Category	What goes wrong
Reentrancy	An external call lets the attacker re-enter the function before state updates finish
Access control	Functions that should be owner-only are not, or unsafe `tx.origin` is used
Arithmetic	Integer overflow or underflow on older Solidity versions
Unchecked low-level calls	Return values of `call()`, `send()`, or `delegatecall()` are ignored
Denial of service	Contract can be permanently locked by attacker-controlled state
Bad randomness	Predictable seeds like `block.timestamp` or `blockhash` used as randomness
Front-running	Transactions ordered against the user's interest by a miner or searcher
Time manipulation	Logic depending on `block.timestamp`, which miners can shift within bounds
Short addresses	EVM accepts incorrectly padded calldata, enabling theft
Other	Edge cases that do not fit the above

Three concrete examples in code. Reentrancy — the canonical DAO-hack pattern:

function withdraw(uint amount) public {
    require(balances[msg.sender] >= amount);
    msg.sender.call.value(amount)("");   // external call FIRST
    balances[msg.sender] -= amount;       // state update SECOND
}

Integer overflow — the BEC token bug pattern:

function transfer(address to, uint256 value) public {
    require(balances[msg.sender] - value >= 0);  // wraps around on underflow
    balances[msg.sender] -= value;
    balances[to] += value;
}

Bad randomness — a generic lottery:

function lottery() public returns (bool) {
    bytes32 random = keccak256(abi.encodePacked(block.timestamp, block.number));
    return uint(random) % 2 == 0;  // miners control both inputs
}

A correct audit names the category and points to the offending line. Both are required.

Methodology

Dataset. I sampled 56 vulnerable Solidity contracts from SmartBugs Curated [1], the standard academic benchmark introduced at ICSE 2020. The sample is balanced across 9 of the 10 DASP categories. The "other" catch-all category is excluded, and "short_addresses" has only one contract across the entire upstream dataset. Each contract has hand-labeled vulnerable lines and a category, sourced from real Etherscan deployments and ConsenSys's SWC Registry [2].

Sanitization. SmartBugs contracts contain inline comments like // <yes> <report> REENTRANCY that label the bugs directly. Sending these to LLMs would let any model "cheat" by reading the labels. I stripped these markers — and the @vulnerable_at_lines header field — replacing them with whitespace so that line numbers stayed identical. A diff against every original confirmed that no leakage remained.

Prompt. Every model received the exact same prompt: "Identify every vulnerability and classify each one according to DASP-10. Return JSON in this schema: {vulnerabilities: [{category, lines, explanation}]}." Identical wording, identical output schema, identical evaluation pipeline.

Scoring. For each (contract, model) pair, I record two metrics:

Lenient recall — did the model report the correct DASP-10 category somewhere in its findings?
Strict recall — did it also identify a vulnerable line within ±2 lines of a labeled one?

For per-category metrics, each (contract, category) pair becomes a binary classification:

True positive: model reported category C, and the contract was labeled C.
False positive: model reported C, but the contract was not labeled C.
False negative: model did not report C, but the contract was labeled C.

The standard precision, recall, and F1 follow:

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 = 2 · Precision · Recall / (Precision + Recall)

For inter-model agreement, I use Cohen's κ, which measures agreement above chance:

κ = (p_o − p_e) / (1 − p_e)

where p_o is observed agreement and p_e is the agreement expected by chance. A κ of 1.0 means perfect agreement, 0 means agreement at chance level, and negative values mean systematic disagreement.

A Methodological Landmine: The Gemini Gotcha

Before showing the headline results, an important detour.

My first run gave Gemini 3.1 Pro a 69.6% recall — far behind Claude (98%) and GPT (95%). I pulled out the failed responses and found this:

{
  "vulnerabilities": [
    {
      "category": "reentrancy",
      "lines": [54, 56],
      "explanation": "The Collect function uses a low-level call to send Ether before deducting the requested amount from the user's balance, enabling a re

The JSON was being truncated mid-sentence. Every single failure followed the same pattern: valid start, correct category, cut-off body. The cause: Gemini 3.1 Pro consumes internal "thinking tokens" before producing visible output, and my max_output_tokens=2048 cap counted both. On larger contracts, the model burned through its budget reasoning, then had nothing left to write the JSON output [3].

Bumping the budget to 16K and re-running only the affected 13 contracts gave:

Metric	Before fix	After fix
Lenient recall	69.6%	89.3%
Strict recall	64.3%	83.9%
Parse failure rate	23.2%	0.0%

Twenty percentage points of measured "performance" was a benchmarking artifact. This is exactly the kind of subtlety that makes inter-model comparisons hard to do fairly: defaults that work for one model architecture systematically penalize another. Anyone publishing LLM benchmarks in 2026 should explicitly report per-model output token budgets and the breakdown between thinking tokens and visible output tokens.

All numbers below are from the corrected run.

Results

The per-category breakdown shows where each model holds up and where it breaks down:

Reentrancy, arithmetic, and unchecked low-level calls — all three models hit 100%. These are the textbook DASP-10 categories, the patterns every audit blog has covered for years. The models clearly memorized them well, which raises a training-data-contamination concern I will return to.
Time manipulation: Claude 100%, Gemini 60%, GPT 60%. Claude is unusually strong here.
Front-running: Claude 100%, GPT 75%, Gemini 75%. Same pattern.
Short addresses (n=1): Claude 0%, Gemini 0%, GPT 100%. Single sample, so do not read too much into it. Short-address attacks are an ABI-level EVM quirk that has been deprecated in modern Solidity and may simply be falling out of training data.

But the most interesting finding is not in this chart. It is in the next one.

The Strategy Contrast: Recall vs Localization

Look at the gap between each model's lenient and strict recall:

Claude: 98.2% → 82.1% (gap: 16 points)
Gemini: 89.3% → 83.9% (gap: 5 points)
GPT-5.5: 94.6% → 92.9% (gap: 2 points)

Claude is the only model with a large gap. Translation: Claude finds the right category almost always, but it often points to the wrong line. GPT-5.5 finds slightly fewer bugs but pinpoints them with surgical precision.

Why? The confusion matrix tells the story.

The access_control column in Claude's matrix is high almost everywhere. On reentrancy contracts, Claude reports access_control 88% of the time. On unchecked-low-level-call contracts, 100%. On arithmetic contracts, 62%. These are off-diagonal "extra" findings, on top of the correct diagonal classification.

Inspecting Claude's actual responses, the pattern becomes clear: Claude tends to flag every code-quality concern it sees as a vulnerability. Stylistic issues like "functions lack explicit visibility modifiers" get tagged as access_control even when the function being public was almost certainly intentional. The category technically exists in DASP-10, but the finding is not really a vulnerability.

GPT-5.5 takes the opposite approach: report one focused finding, get it right.

Two valid auditor archetypes, two different optimization targets. A real-world auditor would describe Claude's behavior as "high false-positive rate." But they would also note that surfacing every concern is what you want from a first-pass tool you intend to review by hand. GPT-5.5's behavior is what you want from a CI gate that will auto-block deploys, where false positives are expensive.

The Ensemble Insight

Pairwise Cohen's κ between the three models on hit-vs-miss decisions:

Claude ↔ Gemini: 0.26 (weak agreement)
Claude ↔ GPT: −0.03 (essentially independent)
Gemini ↔ GPT: 0.40 (moderate agreement)

These are low. By comparison, two human auditors evaluating the same contract set would typically score κ ≈ 0.7–0.9.

The models are catching different things. When one misses a bug, the others often catch it — which is what low κ means in this context. Across all 56 contracts, at least one of the three models identified the correct category in 100% of cases.

The practical implication: an ensemble that runs all three models in parallel and unions their findings would beat any individual model. The cost is paying for three API calls per contract instead of one (~$0.10 per contract instead of ~$0.04), and a higher false-positive rate to triage manually.

For high-stakes contracts where a single missed bug means lost user funds, that math probably works out.

Limitations

Three honest caveats are worth stating explicitly:

Training-data contamination is a real concern. SmartBugs contracts are public, several years old, and indexed everywhere on GitHub. The 100% recall on reentrancy and arithmetic likely reflects pattern recognition more than actual reasoning — those exact contracts were almost certainly in every model's training set. A more rigorous follow-up would test on contracts deployed after each model's training cutoff date.
SmartBugs labels are not exhaustive. A contract labeled reentrancy may genuinely also have an unlabeled access-control issue. This means raw "false positive" rates are upper bounds — some "wrong" findings may be uncovered real bugs that the upstream dataset simply did not annotate.
n = 56 is small. I ran a balanced subset for cost and time reasons. The headline numbers carry ±5–10 percentage point error bars in places (especially short_addresses where n=1). A larger study with n in the hundreds would tighten those bounds.

Conclusion

In May 2026, three frontier LLMs can identify the correct DASP-10 vulnerability category in 89–98% of well-known vulnerable smart contracts. None of them is perfectly reliable as a standalone auditor. Combining all three closes the gap.

The choice of "best" model depends entirely on the application:

First-pass triage tool where you will manually review findings → Claude (highest recall, surfaces everything)
CI auto-blocker where false positives stop deploys → GPT-5.5 (highest precision, focused findings)
Cost-sensitive batch scanning → Gemini (cheapest after fixing the token budget)

And for anyone running LLM benchmarks of their own: explicitly report per-model output token budgets, and verify that truncation is not the explanation before concluding one model is "worse" than another. Twenty points of my benchmark were hiding in that single configuration detail.

References

[1] Durieux, Ferreira, Abreu, Cruz. Empirical Review of Automated Analysis Tools on 47,587 Ethereum Smart Contracts. Proceedings of the 42nd International Conference on Software Engineering (ICSE), 2020. SmartBugs Curated dataset.

[2] ConsenSys Diligence. Smart Contract Weakness Classification (SWC) Registry. swcregistry.io.

[3] Google AI for Developers. Gemini 3 Developer Guide — documentation on dynamic thinking and thinking_level parameter behavior. ai.google.dev.

[4] NCC Group. Decentralized Application Security Project (DASP) Top 10. dasp.co.

DEV Community