Why an AI Agent Should Not Be Treated as Proof: Building EllipticZero Research Lab

#ai #web3 #python #security

In security review, large language models are useful, but they also create a dangerous temptation: the output often sounds more certain than the evidence behind it.

A model can summarize code, suggest review directions, produce hypotheses, and write a convincing report. That is useful work. But a model sentence is not proof. In smart-contract security, cryptography, access control, asset flow, signing assumptions, and upgrade logic, an unsupported confident answer is not just noisy. It can push a reviewer toward the wrong risk, the wrong fix, or the wrong sense of completion.

I built EllipticZero Research Lab around that boundary.

Project:

https://github.com/ECD5A/EllipticZero

EllipticZero Research Lab is a source-available local-first research workflow for scoped smart-contract security review and defensive ECC research. The core idea is simple: agents can help with planning, hypotheses, critique, and report structure, but substantive claims should remain tied to local tools, traces, manifests, replayable artifacts, exportable reports, and human review.

It is not positioned as an autonomous hacking system or a replacement for auditors. The goal is narrower and more practical: help a careful reviewer see what was checked, what the local evidence supports, what still needs manual validation, and what artifact remains after the review.

The Problem With Model-Only Security Output

If you give a contract to an LLM and ask it to find bugs, it will usually produce something. Sometimes that something is a useful hypothesis. Sometimes it is a shallow detector-style warning. Sometimes it is a plausible vulnerability description that is not connected to a reachable path, local tool output, a source location, or a reproducible artifact.

That is a poor fit for security review.

A reviewer needs to know:

what code or repository surface was inspected;
which local tools or checks produced signals;
which findings are supported by evidence;
which findings are only hypotheses;
what still needs human validation;
how to repeat or compare the run later;
how to export the result without turning a weak signal into a confirmed vulnerability.

This is why I think AI-assisted security tools should be designed around evidence boundaries first, and agent reasoning second.

Evidence First, Agent Second

The working rule in EllipticZero Research Lab is:

An agent can propose hypotheses, structure the review, and criticize conclusions. Evidence remains in local checks, artifacts, traces, manifests, replay bundles, and human validation.

That changes the shape of the system.

The agent is not the final authority. It is part of the workflow. It can help decide what to inspect next, explain why a risk lane matters, and turn noisy output into a readable review queue. But a finding should not become stronger just because the text sounds strong.

The useful output is not "the AI found a bug." The useful output is a review artifact that another person can inspect.

How The Workflow Is Structured

EllipticZero Research Lab is organized around a few layers.

The first layer is local context: contract code, repository inventory, selected domain, local tool availability, synthetic cases, saved sessions, and artifacts. The workflow is designed to preserve what was actually available during a run.

The second layer is bounded agent work. Agent roles can help with math, cryptography, strategy, hypotheses, critique, and reporting. Their job is to improve the review process, not to convert an unsupported statement into proof.

The third layer is the artifact layer: sessions, traces, manifests, replay bundles, Markdown reports, SARIF exports, evidence coverage, toolchain fingerprints, and redacted JSON snapshots. If the result cannot be inspected later, it is much less useful for serious review.

The fourth layer is the reviewer. The report should make it clear what was observed, what was inferred, what evidence exists, what is weak, and what should be checked manually.

Why Smart Contracts And ECC

Smart contracts are a useful first domain for this kind of workflow because they have repeatable review lanes:

access control;
upgrade and storage layout;
asset flow;
vault and share accounting;
oracle assumptions;
signatures;
rewards;
AMM and liquidity logic;
bridge and custody surfaces;
staking and treasury logic.

Those lanes are structured enough to support repeatable review, but dangerous enough that overconfident automation is risky.

For example, "there is an external call" is not the same as "there is an exploitable reentrancy bug." "There is an admin function" is not the same as "there is a critical access-control vulnerability." A useful workflow needs context, reachability, state transition reasoning, local signals, and a clear manual-review boundary.

The ECC side is included as defensive research: point formats, curve metadata, subgroup/cofactor checks, twist hygiene, encoding boundaries, and curve-family consistency. This is another area where model confidence without local computation is not enough.

What A Useful Result Looks Like

The target result is not a dramatic list of "10 critical bugs." A better result is a cautious review snapshot:

finding cards;
risk lanes;
source-line hints when available;
local tool signals;
evidence coverage;
confidence notes;
manual-review boundaries;
remediation direction;
recheck path;
reproducibility bundle.

That is less flashy than an AI-generated audit claim, but it is more useful.

If a smart-contract golden case is run, the system should show not only a potential risk, but also why it entered the review queue, what local evidence exists, what remains unconfirmed, and how a reviewer can repeat or export the result.

Why Markdown, SARIF, And Replay Matter

Security workflows need exports.

Markdown is useful because people can read it, send it, attach it to discussions, compare it with previous runs, and use it as a review packet.

SARIF is useful because it can fit into code-scanning and CI-like workflows. But SARIF output needs care. A SARIF item should not automatically become a confirmed vulnerability just because it exists in an export. In an AI-assisted workflow, an exported item may be a review item, a local signal, or a hypothesis that still requires validation.

Replay and reproducibility matter for a similar reason. If the review result cannot be revisited, compared, or explained later, it is hard to defend in front of a team, client, or auditor.

Why Mock Mode Matters

The project supports hosted providers when configured, but it also keeps a no-key evaluation path.

That matters because an evaluator should be able to inspect the shape of the workflow without first trusting an external model provider or sending private code anywhere. A local reviewer should be able to run a self-check, open golden cases, inspect report shape, and see export behavior before deciding whether to configure a live model.

For a security tool, mock mode is not just a convenience. It is part of the evaluation boundary.

Current Project Shape

The current repository includes:

an interactive CLI workflow;
scoped smart-contract review lanes;
defensive ECC research paths;
bounded agent roles;
local-first evidence handling;
evaluation guide and golden cases;
reproducibility/session artifacts;
replay bundle path;
Markdown report export;
SARIF review export;
benchmark scorecards;
security and data-handling boundaries;
commercial licensing documentation for hosted, OEM, white-label, resale, and paid platform use cases.

The project is source-available, not open source in the usual permissive sense. It can be read, evaluated, and run locally under the published license terms. Commercial productization paths require a separate commercial license.