Verivus OSS Releases

Posted on Apr 7

3 AIs Reviewed the Same Codebase. They Disagreed on 2 Findings. That is the Point.

#ai #codereview #python #opensource

We have a rule at Verivus Labs: before code ships, it gets reviewed by three AI models independently. We require unconditional approval from Claude, Codex, and Gemini before anything merges. We wrote about the mechanics of that process in The Codex Review Gate.

That process works well on our own code. We wanted to know whether it finds real things in code we did not write. Code that is already well-maintained and well-structured.

Simon Willison's llm is one of the better-engineered CLI tools in the Python ecosystem. It has a clean architecture, a comprehensive plugin system, and parameterized SQL throughout. The reviewers independently noted the consistent SQL safety, which speaks to the care that has gone into the project. We pointed our tools at it and filed the findings that survived review.

The setup

Two of our tools did the heavy lifting.

sqry is our AST-based code analysis tool. We wrote about it in The Code Question grep Can't Answer. It parses code structurally, building function signatures, call graphs, and dependency relationships, and exposes them through an MCP server. sqry gave the reviewers a structural map of 40 Python source files containing 5,499 symbols and 7,277 edges.

llm-cli-gateway coordinated the reviews. It is our MCP server for multi-LLM orchestration. It wraps Claude, Codex, and Gemini through a single interface with retries, circuit breakers, and session management. Each reviewer got the same prompt and the same sqry access, run in separate sessions with no shared context.

We also built an llm plugin that bridges our gateway into Simon's own llm ecosystem. Install with llm install llm-cli-gateway and you get gateway-claude, gateway-codex, and gateway-gemini as models. The plugin requires Node.js 18+ for the gateway runtime. We wanted to contribute to Simon's ecosystem.

The review target was simonw/llm at commit cad03fb, reviewed on April 4, 2026.

What they found

Codex went first. 11 minutes, 307K tokens. It used sqry to navigate the call graph, then fetched source directly from GitHub to verify against specific commits. It identified 8 potential issues.

Gemini went second. 8 minutes. It used sqry hierarchical search and pattern search. It confirmed 5 of Codex's findings and identified 3 new ones.

We then sent each reviewer's unique findings to the other for cross-validation. At this point we had 11 candidate findings, all confirmed by both Codex and Gemini.

Two reviewers is good, but three is better. Claude did an independent adjudication pass over the 11 candidates, reading each relevant source file and providing line-level verdicts. Claude's role was validation. It assessed whether each finding was a genuine defect or a defensible design choice.

Claude confirmed 8 findings. It disputed 2. It marked 1 uncertain.

The disputes taught us the most.

The 2 findings Claude rejected

Uncaught hook exceptions in async tool execution. Codex and Gemini both flagged that before_call/after_call hooks in the async path run outside try/except, meaning a buggy plugin hook crashes the entire parallel tool batch.

Claude disagreed. If an after-call hook throws, that is an unexpected error and should propagate. Silently swallowing hook failures would mask plugin bugs. The current behavior is a defensible design choice.

Memory usage with large attachments. Codex and Gemini both noted that _attachment() eagerly reads entire files into memory, base64-encodes them (33% expansion), and holds everything in a JSON object simultaneously.

Claude's assessment was that this is inherent to how multimodal API calls work. The content has to be serialized to send it. There is no unnecessary duplication. It is the minimum work required by the API contract.

Both are reasonable arguments. This is why three-way review matters. Two models agreeing does not make something a defect. The third model asking whether something is actually wrong, or just uncomfortable, prevents filing noise.

The 1 finding Claude marked uncertain

Async tool execution racing shared Toolbox state. Codex and Gemini flagged that the async path batches tool calls into asyncio.gather(), which could race if a Toolbox instance maintains state across calls. Claude's assessment was that the framework's own state management appears safe, but whether the issue manifests depends on plugin-specific behavior. The framework does not guarantee sequential execution, and plugins may not expect parallelism.

The 8 findings that held up

Three stood out.

PDF attachment data persisted in logs. The redact_data() function strips image_url.url and input_audio.data from logged prompt JSON, but has no case for file.file_data, where PDF attachments are stored as base64. Full PDF contents persist in logs.db. Users who share that database could inadvertently expose document contents. Filed as #1396.

Embedding dedup comparing wrong keys. embed_multi_with_metadata() queries by content_hash but then filters by comparing incoming item IDs against returned row IDs. These are semantically different values. Duplicate content under a new ID bypasses dedup silently. Filed as #1397.

Stale loop variable in tool logging. In log_to_db(), the tool_instances INSERT references tool.plugin from a previous loop. Python loop variables retain their last value after the loop ends, so every tool result gets attributed to whichever toolbox was last in the list. Filed as #1398.

The remaining five: a possible migration race window when multiple processes start before migrations complete (commented on #789), a potential --async --usage crash with AsyncChainResponse, negative --chain-limit failing immediately, asyncio.run() called inside running event loops, and cosine_similarity() dividing by zero on zero vectors.

Severity ratings are our internal assessment. None have been confirmed by the maintainer yet.

#	Finding	Validation	Filed
1	PDF data not stripped by `redact_data()`	3/3	#1396
2	Embedding dedup compares wrong keys	3/3	#1397
3	Possible migration race window	3/3	#789
4	Async tool races shared state	2/3	--
5	`--async --usage` crash	3/3	--
6	Stale loop variable in `log_to_db()`	3/3	#1398
7	Negative `--chain-limit` fails	3/3	--
8	`asyncio.run()` in event loop	3/3	--
9	Hook exceptions crash batch	2/3	--
10	Memory with large attachments	2/3	--
11	`cosine_similarity` / zero	3/3	--

What sqry contributed

sqry gave the reviewers structural navigation instead of text search:

find_cycles confirmed zero import cycles and one guarded call cycle (get_model calling get_async_model and vice versa)
complexity_metrics identified logs_list() at complexity 43 (622 lines) and prompt() at complexity 35 (450 lines, 30 parameters)
direct_callers and explain_code let Codex trace the full _attachment() to log_to_db() to redact_data() call path that exposed the PDF issue
pattern_search found the stale loop variable pattern across the codebase

Structural navigation means the reviewers could follow call paths and dependency chains rather than searching for keywords. That is the difference between asking "where is this function called" and actually knowing.

Try it

The llm plugin provides the simplest entry point. It routes through the MCP gateway under the hood. For structural review like we describe in this article, you would also want sqry running as an MCP server so the models can navigate call graphs.

# Install the llm plugin (requires Node.js 18+)
llm install llm-cli-gateway

# Basic usage
llm -m gateway-codex "Review this file for bugs: $(cat src/main.py)"
llm -m gateway-gemini "Review this file for bugs: $(cat src/main.py)"

# For structural review with sqry, use the MCP gateway directly
npm install -g llm-cli-gateway

Gateway: github.com/verivus-oss/llm-cli-gateway
Plugin: pypi.org/project/llm-cli-gateway
sqry: github.com/verivus-oss/sqry

What we took away

The findings we filed are candidates that survived three-way review. The maintainer may disagree with some of them. The point of the exercise was to test the methodology, and we are grateful to Simon for building llm in the open where this kind of analysis is possible.

The reviewers did not find SQL injection surfaces in the paths they inspected. The issues they found are subtle. Stale loop variables, key mismatches in dedup logic, missing cases in sanitization functions. These are the kind of things that survive human review because the code reads well.

The result that stayed with us was the disagreements. Two models confirming something does not make it true. The third model asking whether something is actually a defect is what separates useful review from noise. That is why you review with multiple perspectives.

We will keep running this pattern. Three independent perspectives catch things that one perspective misses. That is the premise behind llm-cli-gateway, and this was a useful case study.

Werner Kasselman is a software engineer who builds open source developer tools in his spare time, including sqry and llm-cli-gateway. By day he works at ServiceNow. He lives in Australia with his family and blogs at medium.com/@wernerk. Views expressed here are his own and do not represent ServiceNow.

Top comments (5)

NEXADiag Nexa • Apr 13

This is exactly the insight most developers miss.

The disagreement between models isn't a bug — it's the most
valuable signal. If 3 models flag the same issue, it's real.
If only 1 flags it, it's probably noise.

I've been building around this exact principle with NexaVerify —
runs GPT-4, Claude, Gemini & Groq in parallel and scores each
issue by how many models agreed. The confidence scoring changed
how I think about AI code review entirely.

What threshold did you use to decide an issue was "real" vs noise?

Verivus OSS Releases • Apr 25

I have a very structured development process, each agent knows what is supposed to be delivered, it measures the other agent's work product based on that, I allow them freedom to explore the entire work product to improve and identify edge cases etc. mainly claude to produce the first draft, then iterate between codex and claude until they 'settle' and codex provides unconditional approval, claude 'by design' (seemingly) will always default to only fixing blockers and leave a lot of stuff undone, mocked, stubbed, codex calls it out with the unconditional approval requirement. once they agree, the approved product goes to gemini. thanks for the Q @nexadiag_nexa_312a4b5f603

NEXADiag Nexa • Apr 25

Brilliant breakdown, Werner. The transition from 'simple LLM output' to a 'structured multi-agent consensus' is exactly where the industry is heading.

I’m currently developing a new update for NexaVerify, and I’ve been working on a similar philosophy regarding the 'Arbiter' role. In my architecture, I’ve implemented what I call a 'Supervisor of Supervisors' layer.

Instead of just picking a winner, this layer analyzes the topology of the disagreement. It calculates the 'Consensus Health' by mapping how different reasoning styles (e.g., Claude’s structural logic vs. Groq’s speed-oriented pattern matching) collide. I’ve found that a disagreement between two specific models is often more informative than a consensus between three others.

Also, your point about 'design choices' vs 'bugs' is key. I’ve focused heavily on automating project context comprehension before the agents even see the code. This pre-analysis helps the Supervisor distinguish between a creative architectural choice and a genuine vulnerability based on the specific nature of the software.

This update formalizes these 'Supervision Signals'. Looking forward to seeing how your sqry approach evolves alongside these multi-model strategies!

Verivus OSS Releases • Apr 27

Thanks. The "topology of disagreement" line caught my eye, because we landed somewhere similar from the opposite direction.

Some context on where I'm coming from. I run two adjacent systems.

VRSI is a multi-agent consensus runtime focused on safety and decision quality on individual agent outputs. Five role-specialised agents (planner, critic, verifier, optimiser, sentinel) run over NATS, with OPA policy gates and a Merkle root attached to every decision. There is no arbiter that picks a winner. The sentinel can early-veto on safety, adversarial, and misinformation categories even when the critic returns APPROVE at 0.95 confidence. We added that guard after a real failure: critic-APPROVE was overriding sentinel-review on a reasoning-chain safety bypass scenario.
High-confidence agreement on the wrong axis is the failure mode a "Supervisor of Supervisors" smooths over rather than catches. Thresholds are per scenario category. Strict for safety, balanced for technical, permissive for safe baselines. The explicit category always overrides keyword heuristics, because broad keywords like "authentication" started false-positive-vetoing safe scenarios.

Atelier Studio is a different beast. Not a code reviewer. It's an end-to-end product lifecycle system: eleven councils across three domains. Business Strategy (Research, Executive Strategy, Finance & Economics, Governance & Risk), Operations (Product & BA, Architecture, Operations), and Product Development (Engineering, Security & Compliance, QA, GTM). Cross-council artefacts (ResearchDossier → PRD → ProductWorkPackage → TestPlan → LaunchPlan) carry strict schemas and explicit handoffs.

Reasoning style as a stable per-model property doesn't survive contact with model versions. "Claude-as-structural" and "Groq-as-pattern-matcher" reads cleanly on a slide, but the behaviour you're describing is mostly a function of the role you assigned, the system prompt, and the policy gating the output. Vendor is the noisy variable. We measure agent behaviour per role and per scenario category, and we don't trust per-vendor priors past a release cycle.

Pre-analysing project context to separate "design choice" from "bug" is the right instinct, but it has to be grounded in the actual code graph, not a paraphrase. We use sqry under all our systems for that. Symbol-level call graphs, dependents, type relationships, 35 languages in one graph. A supervisor that arbitrates over a summary inherits whatever the summariser missed.

Where I think we agree: disagreement carries information, and consensus on the wrong axis is its own failure mode. Where I'd nudge harder: the load-bearing piece is the structured intent, evidence, and traceable handoffs underneath, not a smarter arbiter on top.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.