I ran the same smart contract through three AI security audits. The brief was the bug.

#blockchain #ai #security #buildinpublic

A smart contract reviewed by the same model that wrote it is a managed risk at best. Models from the same family, given similar prompts, will apply similar reasoning patterns — not because they're "colluding," but because of their shared DNA. If they're trained on the same overlapping datasets (Common Crawl, GitHub, Stack Overflow), they'll likely converge on the same blind spots regarding obscure Solidity vulnerabilities or specific EIPs. The same interpretive pattern that shaped a flaw is the one most likely to miss it.

The fix isn't to stop using them; it's to increase coverage. Running reviews across different lineages — different training data, different alignment, different fine-tuning — minimises the chance of a shared blind spot. ChatGPT, Gemini, and Qwen are all transformers, but the paths they took to get here are different enough to matter.

For Week 1 on Base, I ran the audit on three models independently: ChatGPT, Gemini, and a local Qwen instance. Same contract, same checklist, parallel sessions. No "peeking" allowed.

What three independent audits found

All three passed. No one found a dealbreaker. That's a useful signal, but it isn't a guarantee. LLM-based reviews can still miss critical vulnerabilities entirely; having three models agree doesn't change the underlying tech's limits. What it does do is ensure a gap isn't just a quirk of one model's training.

Interestingly, all three flagged the same bottleneck: getMessages() iterates over the entire message array to return results newest-first. This is an O(n) scaling issue. On Base (an L2), gas is cheap, but the block gas limit is still the ceiling. While off-chain view calls would handle the load, any on-chain transaction triggering that iteration would eventually revert — a Gas Limit DoS that grows silently alongside adoption.

Qwen called it Medium severity. ChatGPT and Gemini treated it as a Note. The resolution: spec-required, acceptable at current scale, no action before mainnet. The finding was consistent; the panic level was the variable.

The brief was the real problem

This was the biggest takeaway I didn't see coming.

My original brief for all three models was: "Audit the contract against the spec." That's a standard request, but it's also a trap. It frames the spec as the ceiling. A model following that instruction will check if the code matches the document, but it won't necessarily ask if the document itself is flawed or missing key security properties.

I caught the framing error early and updated the briefs: "Perform a full security audit; treat the spec as the correctness baseline, not the audit scope."

It's a minor wording tweak with a massive shift in optimisation. The spec becomes a reference — what the contract is supposed to do — rather than a boundary — the only thing you need to check. The first brief asks for conformance. The second asks for vulnerabilities.

The lesson: prompts are specifications. The same discipline that goes into writing a contract interface — precise, unambiguous, explicit — has to apply to the security brief. Vague input produces vague output. Not because the model is "lazy," but because the brief didn't ask the right question.

The new standing structure

Three-model review is now the standard. Each week, three parallel briefs go out — SECURITY_CHATGPT.md, SECURITY_GEMINI.md, SECURITY_QWEN.md — and consolidate into a single security.md for the final handoff.

This isn't just overhead. It's the difference between a single-pass sanity check and a robust coverage strategy. It's not a replacement for formal verification or a professional audit, but it's significantly more reliable than a single-model pass.

The structural risk of AI blind spots is real. The solution has to be structural, too. Running a parallel process surfaced a flaw in the briefing that a single-model review never would have caught.

The contract passed. The process stayed. The next brief is already in the works.

→ The full Week 1 build — deploy experience, faucet reality, rubric scores — is in the retrospective: Week 1: Base — 56/60

→ The live app is at https://proof-of-support.pages.dev