We have a rule at Verivus Labs: before code ships, it gets reviewed by three AI models independently. We require unconditional approval from Claude, Codex, and Gemini before anything merges. We wrote about the mechanics of that process in The Codex Review Gate.
That process works well on our own code. We wanted to know whether it finds real things in code we did not write. Code that is already well-maintained and well-structured.
Simon Willison's llm is one of the better-engineered CLI tools in the Python ecosystem. It has a clean architecture, a comprehensive plugin system, and parameterized SQL throughout. The reviewers independently noted the consistent SQL safety, which speaks to the care that has gone into the project. We pointed our tools at it and filed the findings that survived review.
The setup
Two of our tools did the heavy lifting.
sqry is our AST-based code analysis tool. We wrote about it in The Code Question grep Can't Answer. It parses code structurally, building function signatures, call graphs, and dependency relationships, and exposes them through an MCP server. sqry gave the reviewers a structural map of 40 Python source files containing 5,499 symbols and 7,277 edges.
llm-cli-gateway coordinated the reviews. It is our MCP server for multi-LLM orchestration. It wraps Claude, Codex, and Gemini through a single interface with retries, circuit breakers, and session management. Each reviewer got the same prompt and the same sqry access, run in separate sessions with no shared context.
We also built an llm plugin that bridges our gateway into Simon's own llm ecosystem. Install with llm install llm-cli-gateway and you get gateway-claude, gateway-codex, and gateway-gemini as models. The plugin requires Node.js 18+ for the gateway runtime. We wanted to contribute to Simon's ecosystem.
The review target was simonw/llm at commit cad03fb, reviewed on April 4, 2026.
What they found
Codex went first. 11 minutes, 307K tokens. It used sqry to navigate the call graph, then fetched source directly from GitHub to verify against specific commits. It identified 8 potential issues.
Gemini went second. 8 minutes. It used sqry hierarchical search and pattern search. It confirmed 5 of Codex's findings and identified 3 new ones.
We then sent each reviewer's unique findings to the other for cross-validation. At this point we had 11 candidate findings, all confirmed by both Codex and Gemini.
Two reviewers is good, but three is better. Claude did an independent adjudication pass over the 11 candidates, reading each relevant source file and providing line-level verdicts. Claude's role was validation. It assessed whether each finding was a genuine defect or a defensible design choice.
Claude confirmed 8 findings. It disputed 2. It marked 1 uncertain.
The disputes taught us the most.
The 2 findings Claude rejected
Uncaught hook exceptions in async tool execution. Codex and Gemini both flagged that before_call/after_call hooks in the async path run outside try/except, meaning a buggy plugin hook crashes the entire parallel tool batch.
Claude disagreed. If an after-call hook throws, that is an unexpected error and should propagate. Silently swallowing hook failures would mask plugin bugs. The current behavior is a defensible design choice.
Memory usage with large attachments. Codex and Gemini both noted that _attachment() eagerly reads entire files into memory, base64-encodes them (33% expansion), and holds everything in a JSON object simultaneously.
Claude's assessment was that this is inherent to how multimodal API calls work. The content has to be serialized to send it. There is no unnecessary duplication. It is the minimum work required by the API contract.
Both are reasonable arguments. This is why three-way review matters. Two models agreeing does not make something a defect. The third model asking whether something is actually wrong, or just uncomfortable, prevents filing noise.
The 1 finding Claude marked uncertain
Async tool execution racing shared Toolbox state. Codex and Gemini flagged that the async path batches tool calls into asyncio.gather(), which could race if a Toolbox instance maintains state across calls. Claude's assessment was that the framework's own state management appears safe, but whether the issue manifests depends on plugin-specific behavior. The framework does not guarantee sequential execution, and plugins may not expect parallelism.
The 8 findings that held up
Three stood out.
PDF attachment data persisted in logs. The redact_data() function strips image_url.url and input_audio.data from logged prompt JSON, but has no case for file.file_data, where PDF attachments are stored as base64. Full PDF contents persist in logs.db. Users who share that database could inadvertently expose document contents. Filed as #1396.
Embedding dedup comparing wrong keys. embed_multi_with_metadata() queries by content_hash but then filters by comparing incoming item IDs against returned row IDs. These are semantically different values. Duplicate content under a new ID bypasses dedup silently. Filed as #1397.
Stale loop variable in tool logging. In log_to_db(), the tool_instances INSERT references tool.plugin from a previous loop. Python loop variables retain their last value after the loop ends, so every tool result gets attributed to whichever toolbox was last in the list. Filed as #1398.
The remaining five: a possible migration race window when multiple processes start before migrations complete (commented on #789), a potential --async --usage crash with AsyncChainResponse, negative --chain-limit failing immediately, asyncio.run() called inside running event loops, and cosine_similarity() dividing by zero on zero vectors.
Severity ratings are our internal assessment. None have been confirmed by the maintainer yet.
| # | Finding | Validation | Filed |
|---|---|---|---|
| 1 | PDF data not stripped by redact_data()
|
3/3 | #1396 |
| 2 | Embedding dedup compares wrong keys | 3/3 | #1397 |
| 3 | Possible migration race window | 3/3 | #789 |
| 4 | Async tool races shared state | 2/3 | -- |
| 5 |
--async --usage crash |
3/3 | -- |
| 6 | Stale loop variable in log_to_db()
|
3/3 | #1398 |
| 7 | Negative --chain-limit fails |
3/3 | -- |
| 8 |
asyncio.run() in event loop |
3/3 | -- |
| 9 | Hook exceptions crash batch | 2/3 | -- |
| 10 | Memory with large attachments | 2/3 | -- |
| 11 |
cosine_similarity / zero |
3/3 | -- |
What sqry contributed
sqry gave the reviewers structural navigation instead of text search:
-
find_cyclesconfirmed zero import cycles and one guarded call cycle (get_modelcallingget_async_modeland vice versa) -
complexity_metricsidentifiedlogs_list()at complexity 43 (622 lines) andprompt()at complexity 35 (450 lines, 30 parameters) -
direct_callersandexplain_codelet Codex trace the full_attachment()tolog_to_db()toredact_data()call path that exposed the PDF issue -
pattern_searchfound the stale loop variable pattern across the codebase
Structural navigation means the reviewers could follow call paths and dependency chains rather than searching for keywords. That is the difference between asking "where is this function called" and actually knowing.
Try it
The llm plugin provides the simplest entry point. It routes through the MCP gateway under the hood. For structural review like we describe in this article, you would also want sqry running as an MCP server so the models can navigate call graphs.
# Install the llm plugin (requires Node.js 18+)
llm install llm-cli-gateway
# Basic usage
llm -m gateway-codex "Review this file for bugs: $(cat src/main.py)"
llm -m gateway-gemini "Review this file for bugs: $(cat src/main.py)"
# For structural review with sqry, use the MCP gateway directly
npm install -g llm-cli-gateway
- Gateway: github.com/verivus-oss/llm-cli-gateway
- Plugin: pypi.org/project/llm-cli-gateway
- sqry: github.com/verivus-oss/sqry
What we took away
The findings we filed are candidates that survived three-way review. The maintainer may disagree with some of them. The point of the exercise was to test the methodology, and we are grateful to Simon for building llm in the open where this kind of analysis is possible.
The reviewers did not find SQL injection surfaces in the paths they inspected. The issues they found are subtle. Stale loop variables, key mismatches in dedup logic, missing cases in sanitization functions. These are the kind of things that survive human review because the code reads well.
The result that stayed with us was the disagreements. Two models confirming something does not make it true. The third model asking whether something is actually a defect is what separates useful review from noise. That is why you review with multiple perspectives.
We will keep running this pattern. Three independent perspectives catch things that one perspective misses. That is the premise behind llm-cli-gateway, and this was a useful case study.
Werner Kasselman is a software engineer who builds open source developer tools in his spare time, including sqry and llm-cli-gateway. By day he works at ServiceNow. He lives in Australia with his family and blogs at medium.com/@wernerk. Views expressed here are his own and do not represent ServiceNow.
Top comments (0)