<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Verivus OSS Releases</title>
    <description>The latest articles on DEV Community by Verivus OSS Releases (@verivusossreleases).</description>
    <link>https://dev.to/verivusossreleases</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3852608%2F7b1cf6ad-4264-4bcc-9d3f-93c157d2ef2b.png</url>
      <title>DEV Community: Verivus OSS Releases</title>
      <link>https://dev.to/verivusossreleases</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/verivusossreleases"/>
    <language>en</language>
    <item>
      <title>Multiple Agents, Multiple Workstreams, and the Parts That Still Break</title>
      <dc:creator>Verivus OSS Releases</dc:creator>
      <pubDate>Thu, 16 Apr 2026 08:12:42 +0000</pubDate>
      <link>https://dev.to/verivusossreleases/multiple-agents-multiple-workstreams-and-the-parts-that-still-break-43ep</link>
      <guid>https://dev.to/verivusossreleases/multiple-agents-multiple-workstreams-and-the-parts-that-still-break-43ep</guid>
      <description>&lt;h1&gt;
  
  
  Multiple Agents, Multiple Workstreams, and the Parts That Still Break
&lt;/h1&gt;

&lt;p&gt;I think the current debate around coding agents gets flattened too quickly.&lt;/p&gt;

&lt;p&gt;One side says multiple agents are already here. Separate worktrees, specialized roles, parallel streams of work, and a measurable boost in throughput. The other side says a lot of these systems still over-promise, stall, and leave too much coordination work on the human operator.&lt;/p&gt;

&lt;p&gt;After looking at our own repo activity and fixing a real compatibility break in &lt;code&gt;grokrs&lt;/code&gt;, I think both sides are seeing something real.&lt;/p&gt;

&lt;p&gt;The leverage is real.&lt;/p&gt;

&lt;p&gt;The fragility is real too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The weak version of the debate is already over
&lt;/h2&gt;

&lt;p&gt;The weakest version of the debate is whether multiple agents or multiple workstreams can happen at the same time at all. In our environment, they clearly can.&lt;/p&gt;

&lt;p&gt;I checked recent repo activity across &lt;code&gt;/srv/repos/internal/verivusai-labs&lt;/code&gt; and &lt;code&gt;/srv/repos/public&lt;/code&gt; and looked for same-hour overlap in both repository work and agent-specific metadata directories.&lt;/p&gt;

&lt;p&gt;Here is the short version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;41&lt;/code&gt; git repos scanned&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;25&lt;/code&gt; with activity in the last four weeks&lt;/li&gt;
&lt;li&gt;busiest days: &lt;code&gt;2026-03-21&lt;/code&gt; and &lt;code&gt;2026-04-05&lt;/code&gt;, both with &lt;code&gt;7&lt;/code&gt; active repos&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There were also clear same-hour overlaps between agents and repos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Ghost&lt;/code&gt; on &lt;code&gt;2026-03-30&lt;/code&gt;: Claude, Codex, and Cursor active in the same hour&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;arctos&lt;/code&gt; on &lt;code&gt;2026-04-02&lt;/code&gt;: Claude and AIVCS active in the same hour&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GitNexus&lt;/code&gt; on &lt;code&gt;2026-04-05&lt;/code&gt;: Claude and Cursor active in the same hour&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not a vibes-based claim. It is timestamped concurrent work.&lt;/p&gt;

&lt;p&gt;So I do not think “multiple workstreams are fake” is a serious position anymore. The better question is whether multiple agents can work in parallel in a way that is reliable, observable, and cheap to integrate.&lt;/p&gt;

&lt;p&gt;That is where things get more interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually seems to break first
&lt;/h2&gt;

&lt;p&gt;From what I can see, the first failures usually happen around the agent system, not inside the basic idea of parallelism itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Isolation
&lt;/h3&gt;

&lt;p&gt;This is why worktrees keep coming up in the strongest pro-agent posts.&lt;/p&gt;

&lt;p&gt;If multiple agents share mutable state carelessly, they interfere with each other. They overwrite assumptions, pollute local context, and turn parallel work into a race condition.&lt;/p&gt;

&lt;p&gt;The useful claim is not “I launched a bunch of agents.” The useful claim is “I gave them isolated execution surfaces, so they could run without stepping on each other.”&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Visibility
&lt;/h3&gt;

&lt;p&gt;One of the better skeptical complaints I saw was from Demir Bülbüloğlu on &lt;code&gt;2026-02-22&lt;/code&gt;. The complaint was not just that a system failed. It was that the system &lt;em&gt;claimed&lt;/em&gt; to be running multiple agents and then stalled instead of finishing.&lt;/p&gt;

&lt;p&gt;That matters because it points to a gap between claimed concurrency and observable concurrency.&lt;/p&gt;

&lt;p&gt;Once a system says it is running multiple agents, the operator needs answers to a few basic questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which task is active&lt;/li&gt;
&lt;li&gt;which agent owns which workspace&lt;/li&gt;
&lt;li&gt;whether a tool call finished, failed, or retried&lt;/li&gt;
&lt;li&gt;whether output was actually produced or quietly dropped&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without that, “multiple workstreams” is not really a workflow model. It is hidden state with a strong marketing wrapper.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Protocol drift
&lt;/h3&gt;

&lt;p&gt;I got a very practical reminder of this while repairing &lt;code&gt;grokrs --x-search&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The break had nothing to do with whether X search was conceptually possible. It was a compatibility problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;grokrs&lt;/code&gt; still emitted top-level &lt;code&gt;search_parameters&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;xAI now expects search configuration on the tool objects themselves&lt;/li&gt;
&lt;li&gt;the old shape fell onto the deprecated Live Search path and returned HTTP &lt;code&gt;410&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After fixing that request shape, more drift showed up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;newer Responses payloads included &lt;code&gt;output_text&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;newer tool-backed responses also carried server-side tool usage in a shape our parser did not yet accept&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a normal systems problem, but I think it is exactly the kind of problem that gets misread in agent discourse. A workflow can be conceptually valid and still be operationally brittle because its boundaries are stale.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Human coordination load
&lt;/h3&gt;

&lt;p&gt;This is where the skepticism from priyanka’s &lt;code&gt;2026-03-11&lt;/code&gt; post lands for me. Even if multiple workstreams are real, the human often still carries the most expensive parts of the workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;decomposing the work&lt;/li&gt;
&lt;li&gt;deciding which stream matters more&lt;/li&gt;
&lt;li&gt;reviewing partial outputs&lt;/li&gt;
&lt;li&gt;merging conflicting changes&lt;/li&gt;
&lt;li&gt;deciding what to retry and what to discard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that burden remains too high, then the system has not really achieved delegation. It has achieved assisted supervision.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I changed in &lt;code&gt;grokrs&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;To get &lt;code&gt;grokrs ... --x-search&lt;/code&gt; working again, I made a few targeted compatibility fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stop sending deprecated top-level &lt;code&gt;search_parameters&lt;/code&gt; for tool-backed search&lt;/li&gt;
&lt;li&gt;Move X search filters onto the &lt;code&gt;x_search&lt;/code&gt; tool object&lt;/li&gt;
&lt;li&gt;Accept newer response shapes like &lt;code&gt;output_text&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Make the usage parser more tolerant of current server-side tool usage payloads&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After that, this worked again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;grokrs &lt;span class="nt"&gt;--profile&lt;/span&gt; dev agent &lt;span class="nt"&gt;--headless&lt;/span&gt; &lt;span class="nt"&gt;--approval-mode&lt;/span&gt; allow &lt;span class="nt"&gt;--x-search&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-iterations&lt;/span&gt; 2 &lt;span class="s2"&gt;"Summarize what people are saying about xAI on X in one sentence."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also reran the package test suites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;grokrs-api&lt;/code&gt;: &lt;code&gt;906&lt;/code&gt; tests passed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grokrs-cli&lt;/code&gt;: &lt;code&gt;291&lt;/code&gt; tests passed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I think this is the more useful lesson from the repair: multi-agent systems often degrade first at the boundaries. Not in the screenshot. Not in the prompt demo. At the boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  The synthesis that seems most honest
&lt;/h2&gt;

&lt;p&gt;I do not think the right conclusion is that the optimists are wrong or the skeptics are wrong.&lt;/p&gt;

&lt;p&gt;The optimistic posts are right that parallel work is already useful. The skeptical posts are right that a system which merely &lt;em&gt;claims&lt;/em&gt; parallelism is not enough.&lt;/p&gt;

&lt;p&gt;The synthesis I keep coming back to is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multiple agents are real&lt;/li&gt;
&lt;li&gt;multiple workstreams are useful&lt;/li&gt;
&lt;li&gt;neither is self-validating&lt;/li&gt;
&lt;li&gt;the hard part is shifting from generation to coordination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why the interesting engineering work now seems to be moving toward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;isolated worktrees and workspace boundaries&lt;/li&gt;
&lt;li&gt;explicit ownership of subtasks&lt;/li&gt;
&lt;li&gt;event and progress visibility&lt;/li&gt;
&lt;li&gt;parsers and clients that survive upstream churn&lt;/li&gt;
&lt;li&gt;review and merge loops that handle partial failure well&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I think that is the part of the story that matters most. “Can an agent write code?” is no longer the whole question. “Can the system around several agents make their work dependable?” is the real one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;Saying “we run multiple agents” is easy.&lt;/p&gt;

&lt;p&gt;What matters is whether those agents can work in parallel without corrupting state, whether the operator can see what is happening, whether the system survives interface drift, and whether the outputs are cheap to review and integrate.&lt;/p&gt;

&lt;p&gt;That is the line between a screenshot and an operating model.&lt;/p&gt;

&lt;p&gt;The X discourse feels like it is converging on that distinction, even when the posts sound like they disagree. One side is seeing the leverage. The other side is seeing the fragility.&lt;/p&gt;

&lt;p&gt;I think both are describing the same transition from different angles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Boris Cherny
&lt;a href="https://x.com/bcherny/status/2038454353787519164" rel="noopener noreferrer"&gt;https://x.com/bcherny/status/2038454353787519164&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Abdulmuiz Adeyemo
&lt;a href="https://x.com/AbdMuizAdeyemo/status/2025519825691283657" rel="noopener noreferrer"&gt;https://x.com/AbdMuizAdeyemo/status/2025519825691283657&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Numman Ali
&lt;a href="https://x.com/nummanali/status/2019473874455331156" rel="noopener noreferrer"&gt;https://x.com/nummanali/status/2019473874455331156&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Julian Goldie
&lt;a href="https://x.com/JulianGoldieSEO/status/2020081836240896487" rel="noopener noreferrer"&gt;https://x.com/JulianGoldieSEO/status/2020081836240896487&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Demir Bülbüloğlu
&lt;a href="https://x.com/demirbulbuloglu/status/2025598095312982249" rel="noopener noreferrer"&gt;https://x.com/demirbulbuloglu/status/2025598095312982249&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;priyanka
&lt;a href="https://x.com/pridesai/status/2031783971047051445" rel="noopener noreferrer"&gt;https://x.com/pridesai/status/2031783971047051445&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>productivity</category>
      <category>rust</category>
    </item>
    <item>
      <title>3 AIs Reviewed the Same Codebase. They Disagreed on 2 Findings. That is the Point.</title>
      <dc:creator>Verivus OSS Releases</dc:creator>
      <pubDate>Tue, 07 Apr 2026 05:51:55 +0000</pubDate>
      <link>https://dev.to/verivusossreleases/3-ais-reviewed-the-same-codebase-they-disagreed-on-2-findings-that-is-the-point-63a</link>
      <guid>https://dev.to/verivusossreleases/3-ais-reviewed-the-same-codebase-they-disagreed-on-2-findings-that-is-the-point-63a</guid>
      <description>&lt;p&gt;We have a rule at Verivus Labs: before code ships, it gets reviewed by three AI models independently. We require unconditional approval from Claude, Codex, and Gemini before anything merges. We wrote about the mechanics of that process in &lt;a href="https://medium.com/@wernerk/the-codex-review-gate-how-we-made-ai-agents-review-each-others-work-59e9ff5465f9" rel="noopener noreferrer"&gt;The Codex Review Gate&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That process works well on our own code. We wanted to know whether it finds real things in code we did not write. Code that is already well-maintained and well-structured.&lt;/p&gt;

&lt;p&gt;Simon Willison's &lt;a href="https://github.com/simonw/llm" rel="noopener noreferrer"&gt;llm&lt;/a&gt; is one of the better-engineered CLI tools in the Python ecosystem. It has a clean architecture, a comprehensive plugin system, and parameterized SQL throughout. The reviewers independently noted the consistent SQL safety, which speaks to the care that has gone into the project. We pointed our tools at it and filed the findings that survived review.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Two of our tools did the heavy lifting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/verivus-oss/sqry" rel="noopener noreferrer"&gt;sqry&lt;/a&gt; is our AST-based code analysis tool. We wrote about it in &lt;a href="https://medium.com/@wernerk/the-code-question-grep-cant-answer-057bfc8d7fe2" rel="noopener noreferrer"&gt;The Code Question grep Can't Answer&lt;/a&gt;. It parses code structurally, building function signatures, call graphs, and dependency relationships, and exposes them through an MCP server. sqry gave the reviewers a structural map of 40 Python source files containing 5,499 symbols and 7,277 edges.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;llm-cli-gateway&lt;/a&gt; coordinated the reviews. It is our MCP server for multi-LLM orchestration. It wraps Claude, Codex, and Gemini through a single interface with retries, circuit breakers, and session management. Each reviewer got the same prompt and the same sqry access, run in separate sessions with no shared context.&lt;/p&gt;

&lt;p&gt;We also built an &lt;a href="https://pypi.org/project/llm-cli-gateway/" rel="noopener noreferrer"&gt;llm plugin&lt;/a&gt; that bridges our gateway into Simon's own &lt;code&gt;llm&lt;/code&gt; ecosystem. Install with &lt;code&gt;llm install llm-cli-gateway&lt;/code&gt; and you get &lt;code&gt;gateway-claude&lt;/code&gt;, &lt;code&gt;gateway-codex&lt;/code&gt;, and &lt;code&gt;gateway-gemini&lt;/code&gt; as models. The plugin requires Node.js 18+ for the gateway runtime. We wanted to contribute to Simon's ecosystem.&lt;/p&gt;

&lt;p&gt;The review target was &lt;code&gt;simonw/llm&lt;/code&gt; at commit &lt;code&gt;cad03fb&lt;/code&gt;, reviewed on April 4, 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What they found
&lt;/h2&gt;

&lt;p&gt;Codex went first. 11 minutes, 307K tokens. It used sqry to navigate the call graph, then fetched source directly from GitHub to verify against specific commits. It identified 8 potential issues.&lt;/p&gt;

&lt;p&gt;Gemini went second. 8 minutes. It used sqry hierarchical search and pattern search. It confirmed 5 of Codex's findings and identified 3 new ones.&lt;/p&gt;

&lt;p&gt;We then sent each reviewer's unique findings to the other for cross-validation. At this point we had 11 candidate findings, all confirmed by both Codex and Gemini.&lt;/p&gt;

&lt;p&gt;Two reviewers is good, but three is better. Claude did an independent adjudication pass over the 11 candidates, reading each relevant source file and providing line-level verdicts. Claude's role was validation. It assessed whether each finding was a genuine defect or a defensible design choice.&lt;/p&gt;

&lt;p&gt;Claude confirmed 8 findings. It disputed 2. It marked 1 uncertain.&lt;/p&gt;

&lt;p&gt;The disputes taught us the most.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 2 findings Claude rejected
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Uncaught hook exceptions in async tool execution.&lt;/strong&gt; Codex and Gemini both flagged that &lt;code&gt;before_call&lt;/code&gt;/&lt;code&gt;after_call&lt;/code&gt; hooks in the async path run outside try/except, meaning a buggy plugin hook crashes the entire parallel tool batch.&lt;/p&gt;

&lt;p&gt;Claude disagreed. If an after-call hook throws, that is an unexpected error and should propagate. Silently swallowing hook failures would mask plugin bugs. The current behavior is a defensible design choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory usage with large attachments.&lt;/strong&gt; Codex and Gemini both noted that &lt;code&gt;_attachment()&lt;/code&gt; eagerly reads entire files into memory, base64-encodes them (33% expansion), and holds everything in a JSON object simultaneously.&lt;/p&gt;

&lt;p&gt;Claude's assessment was that this is inherent to how multimodal API calls work. The content has to be serialized to send it. There is no unnecessary duplication. It is the minimum work required by the API contract.&lt;/p&gt;

&lt;p&gt;Both are reasonable arguments. This is why three-way review matters. Two models agreeing does not make something a defect. The third model asking whether something is actually wrong, or just uncomfortable, prevents filing noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 1 finding Claude marked uncertain
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Async tool execution racing shared Toolbox state.&lt;/strong&gt; Codex and Gemini flagged that the async path batches tool calls into &lt;code&gt;asyncio.gather()&lt;/code&gt;, which could race if a &lt;code&gt;Toolbox&lt;/code&gt; instance maintains state across calls. Claude's assessment was that the framework's own state management appears safe, but whether the issue manifests depends on plugin-specific behavior. The framework does not guarantee sequential execution, and plugins may not expect parallelism.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 8 findings that held up
&lt;/h2&gt;

&lt;p&gt;Three stood out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PDF attachment data persisted in logs.&lt;/strong&gt; The &lt;code&gt;redact_data()&lt;/code&gt; function strips &lt;code&gt;image_url.url&lt;/code&gt; and &lt;code&gt;input_audio.data&lt;/code&gt; from logged prompt JSON, but has no case for &lt;code&gt;file.file_data&lt;/code&gt;, where PDF attachments are stored as base64. Full PDF contents persist in &lt;code&gt;logs.db&lt;/code&gt;. Users who share that database could inadvertently expose document contents. Filed as &lt;a href="https://github.com/simonw/llm/issues/1396" rel="noopener noreferrer"&gt;#1396&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedding dedup comparing wrong keys.&lt;/strong&gt; &lt;code&gt;embed_multi_with_metadata()&lt;/code&gt; queries by &lt;code&gt;content_hash&lt;/code&gt; but then filters by comparing incoming item IDs against returned row IDs. These are semantically different values. Duplicate content under a new ID bypasses dedup silently. Filed as &lt;a href="https://github.com/simonw/llm/issues/1397" rel="noopener noreferrer"&gt;#1397&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stale loop variable in tool logging.&lt;/strong&gt; In &lt;code&gt;log_to_db()&lt;/code&gt;, the &lt;code&gt;tool_instances&lt;/code&gt; INSERT references &lt;code&gt;tool.plugin&lt;/code&gt; from a previous loop. Python loop variables retain their last value after the loop ends, so every tool result gets attributed to whichever toolbox was last in the list. Filed as &lt;a href="https://github.com/simonw/llm/issues/1398" rel="noopener noreferrer"&gt;#1398&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The remaining five: a possible migration race window when multiple processes start before migrations complete (&lt;a href="https://github.com/simonw/llm/issues/789#issuecomment-4188034320" rel="noopener noreferrer"&gt;commented on #789&lt;/a&gt;), a potential &lt;code&gt;--async --usage&lt;/code&gt; crash with &lt;code&gt;AsyncChainResponse&lt;/code&gt;, negative &lt;code&gt;--chain-limit&lt;/code&gt; failing immediately, &lt;code&gt;asyncio.run()&lt;/code&gt; called inside running event loops, and &lt;code&gt;cosine_similarity()&lt;/code&gt; dividing by zero on zero vectors.&lt;/p&gt;

&lt;p&gt;Severity ratings are our internal assessment. None have been confirmed by the maintainer yet.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Finding&lt;/th&gt;
&lt;th&gt;Validation&lt;/th&gt;
&lt;th&gt;Filed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;PDF data not stripped by &lt;code&gt;redact_data()&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/simonw/llm/issues/1396" rel="noopener noreferrer"&gt;#1396&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Embedding dedup compares wrong keys&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/simonw/llm/issues/1397" rel="noopener noreferrer"&gt;#1397&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Possible migration race window&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/simonw/llm/issues/789#issuecomment-4188034320" rel="noopener noreferrer"&gt;#789&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Async tool races shared state&lt;/td&gt;
&lt;td&gt;2/3&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;--async --usage&lt;/code&gt; crash&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Stale loop variable in &lt;code&gt;log_to_db()&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/simonw/llm/issues/1398" rel="noopener noreferrer"&gt;#1398&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Negative &lt;code&gt;--chain-limit&lt;/code&gt; fails&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;asyncio.run()&lt;/code&gt; in event loop&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Hook exceptions crash batch&lt;/td&gt;
&lt;td&gt;2/3&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Memory with large attachments&lt;/td&gt;
&lt;td&gt;2/3&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cosine_similarity&lt;/code&gt; / zero&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What sqry contributed
&lt;/h2&gt;

&lt;p&gt;sqry gave the reviewers structural navigation instead of text search:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;find_cycles&lt;/code&gt; confirmed zero import cycles and one guarded call cycle (&lt;code&gt;get_model&lt;/code&gt; calling &lt;code&gt;get_async_model&lt;/code&gt; and vice versa)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;complexity_metrics&lt;/code&gt; identified &lt;code&gt;logs_list()&lt;/code&gt; at complexity 43 (622 lines) and &lt;code&gt;prompt()&lt;/code&gt; at complexity 35 (450 lines, 30 parameters)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;direct_callers&lt;/code&gt; and &lt;code&gt;explain_code&lt;/code&gt; let Codex trace the full &lt;code&gt;_attachment()&lt;/code&gt; to &lt;code&gt;log_to_db()&lt;/code&gt; to &lt;code&gt;redact_data()&lt;/code&gt; call path that exposed the PDF issue&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pattern_search&lt;/code&gt; found the stale loop variable pattern across the codebase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Structural navigation means the reviewers could follow call paths and dependency chains rather than searching for keywords. That is the difference between asking "where is this function called" and actually knowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;llm&lt;/code&gt; plugin provides the simplest entry point. It routes through the MCP gateway under the hood. For structural review like we describe in this article, you would also want sqry running as an MCP server so the models can navigate call graphs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the llm plugin (requires Node.js 18+)&lt;/span&gt;
llm &lt;span class="nb"&gt;install &lt;/span&gt;llm-cli-gateway

&lt;span class="c"&gt;# Basic usage&lt;/span&gt;
llm &lt;span class="nt"&gt;-m&lt;/span&gt; gateway-codex &lt;span class="s2"&gt;"Review this file for bugs: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;src/main.py&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
llm &lt;span class="nt"&gt;-m&lt;/span&gt; gateway-gemini &lt;span class="s2"&gt;"Review this file for bugs: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;src/main.py&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# For structural review with sqry, use the MCP gateway directly&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; llm-cli-gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Gateway: &lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;github.com/verivus-oss/llm-cli-gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Plugin: &lt;a href="https://pypi.org/project/llm-cli-gateway/" rel="noopener noreferrer"&gt;pypi.org/project/llm-cli-gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;sqry: &lt;a href="https://github.com/verivus-oss/sqry" rel="noopener noreferrer"&gt;github.com/verivus-oss/sqry&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What we took away
&lt;/h2&gt;

&lt;p&gt;The findings we filed are candidates that survived three-way review. The maintainer may disagree with some of them. The point of the exercise was to test the methodology, and we are grateful to Simon for building &lt;code&gt;llm&lt;/code&gt; in the open where this kind of analysis is possible.&lt;/p&gt;

&lt;p&gt;The reviewers did not find SQL injection surfaces in the paths they inspected. The issues they found are subtle. Stale loop variables, key mismatches in dedup logic, missing cases in sanitization functions. These are the kind of things that survive human review because the code reads well.&lt;/p&gt;

&lt;p&gt;The result that stayed with us was the disagreements. Two models confirming something does not make it true. The third model asking whether something is actually a defect is what separates useful review from noise. That is why you review with multiple perspectives.&lt;/p&gt;

&lt;p&gt;We will keep running this pattern. Three independent perspectives catch things that one perspective misses. That is the premise behind &lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;llm-cli-gateway&lt;/a&gt;, and this was a useful case study.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Werner Kasselman is a software engineer who builds open source developer tools in his spare time, including sqry and llm-cli-gateway. By day he works at ServiceNow. He lives in Australia with his family and blogs at &lt;a href="https://medium.com/@wernerk" rel="noopener noreferrer"&gt;medium.com/@wernerk&lt;/a&gt;. Views expressed here are his own and do not represent ServiceNow.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How We Built a Safety-First Rust Agent CLI in Two Days Without Letting the Codebase Turn to Mush</title>
      <dc:creator>Verivus OSS Releases</dc:creator>
      <pubDate>Tue, 07 Apr 2026 01:58:15 +0000</pubDate>
      <link>https://dev.to/verivusossreleases/how-we-built-a-safety-first-rust-agent-cli-in-two-days-without-letting-the-codebase-turn-to-mush-13hj</link>
      <guid>https://dev.to/verivusossreleases/how-we-built-a-safety-first-rust-agent-cli-in-two-days-without-letting-the-codebase-turn-to-mush-13hj</guid>
      <description>&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;p&gt;I think most AI-assisted software fails in one of two ways.&lt;/p&gt;

&lt;p&gt;The first failure mode is obvious. The code is sloppy, the boundaries are fuzzy, and the whole thing feels like a transcript that got committed by accident.&lt;/p&gt;

&lt;p&gt;The second failure mode is more subtle. The code is fine for a demo, but the repo has no durable planning model, no review trail, and no way to explain why one subsystem looks the way it does. A week later, nobody wants to touch it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;grokrs&lt;/code&gt; avoided both.&lt;/p&gt;

&lt;p&gt;This repo is a Rust-only scaffold for a Grok-oriented agent CLI. It is safety-first by design. More importantly, it was built fast without taking the usual shortcuts that make a codebase hard to trust. The artifact trail shows a concentrated implementation burst across &lt;code&gt;2026-04-05&lt;/code&gt; and &lt;code&gt;2026-04-06&lt;/code&gt;, but the result still has clear crate boundaries, deny-by-default policy handling, machine-readable planning, and a review system that is stronger than what I usually see in projects with much longer schedules.&lt;/p&gt;

&lt;p&gt;I want to walk through why that happened, because I think the process is as interesting as the software.&lt;/p&gt;

&lt;p&gt;I also used &lt;code&gt;grokrs&lt;/code&gt; itself to generate the article images and submit this draft to Dev.to. That felt like a fair test of whether the CLI is already useful outside its own repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually built
&lt;/h2&gt;

&lt;p&gt;At the workspace level, &lt;code&gt;grokrs&lt;/code&gt; is not a single crate with a heroic &lt;code&gt;main.rs&lt;/code&gt;. The root &lt;code&gt;Cargo.toml&lt;/code&gt; defines eight members:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;grokrs-core&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grokrs-cap&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grokrs-policy&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grokrs-session&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grokrs-tool&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grokrs-cli&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grokrs-api&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grokrs-store&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That split matters. Each crate has a job. Each crate also imposes a limit.&lt;/p&gt;

&lt;p&gt;The repository currently has &lt;code&gt;130&lt;/code&gt; Rust source files under &lt;code&gt;crates/&lt;/code&gt;, &lt;code&gt;55,963&lt;/code&gt; total Rust source lines, and about &lt;code&gt;41,990&lt;/code&gt; Rust code lines once blank lines and comment-only lines are stripped out. A quick scan of test annotations turns up &lt;code&gt;1,647&lt;/code&gt; &lt;code&gt;#[test]&lt;/code&gt; and &lt;code&gt;#[tokio::test]&lt;/code&gt; markers in the crate tree. The docs side is not small either. There are &lt;code&gt;5&lt;/code&gt; top-level specs in &lt;code&gt;docs/specs/&lt;/code&gt;, and &lt;code&gt;15&lt;/code&gt; &lt;code&gt;IMPLEMENTATION_DAG.toml&lt;/code&gt; files under &lt;code&gt;docs/reviews/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is not a toy codebase pretending to be architecture.&lt;/p&gt;

&lt;p&gt;I reran a fresh full &lt;code&gt;sqry&lt;/code&gt; index on &lt;code&gt;2026-04-07&lt;/code&gt; because I wanted a better read on the codebase before publishing this. The new index came back with numbers that are hard to wave away:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;131&lt;/code&gt; indexed files&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;46,997&lt;/code&gt; indexed symbols&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;61,393&lt;/code&gt; graph edges&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;10,537&lt;/code&gt; functions and &lt;code&gt;6,760&lt;/code&gt; call sites&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0&lt;/code&gt; cycles in the current graph snapshot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The extended graph details are just as telling. &lt;code&gt;sqry&lt;/code&gt; broke the workspace down into &lt;code&gt;24,830&lt;/code&gt; variables, &lt;code&gt;1,269&lt;/code&gt; imports, &lt;code&gt;1,157&lt;/code&gt; types, &lt;code&gt;870&lt;/code&gt; macros, &lt;code&gt;509&lt;/code&gt; methods, &lt;code&gt;339&lt;/code&gt; modules, &lt;code&gt;285&lt;/code&gt; structs, and &lt;code&gt;81&lt;/code&gt; enums. It also reported &lt;code&gt;0&lt;/code&gt; cross-language edges, &lt;code&gt;4,843&lt;/code&gt; duplicate groups, and &lt;code&gt;3,898&lt;/code&gt; unused symbols. That is the kind of inventory I expect from a real codebase, not a weekend mockup.&lt;/p&gt;

&lt;p&gt;For raw size, the current Rust tree under &lt;code&gt;crates/&lt;/code&gt; is &lt;code&gt;55,963&lt;/code&gt; total source lines, or about &lt;code&gt;41,990&lt;/code&gt; code lines by a simple blank-and-comment strip. The git history in this repository shows &lt;code&gt;55,888&lt;/code&gt; net Rust source lines landing on &lt;code&gt;2026-04-06&lt;/code&gt;, so the visible implementation burst was extremely concentrated even if the project story still spans two days.&lt;/p&gt;

&lt;p&gt;The crate breakdown is clean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;grokrs-cap&lt;/code&gt; carries rooted path handling and trust-level types&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grokrs-policy&lt;/code&gt; carries effect classification and deny-by-default evaluation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grokrs-tool&lt;/code&gt; carries tool traits, classification, and registry logic&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grokrs-api&lt;/code&gt; carries xAI and Grok transport, streaming, endpoints, and tool-loop code&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grokrs-store&lt;/code&gt; carries SQLite WAL persistence&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grokrs-session&lt;/code&gt; carries typed lifecycle state&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grokrs-core&lt;/code&gt; carries config and shared domain types&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grokrs-cli&lt;/code&gt; carries user-facing commands and orchestration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I think this is one of the big reasons the repo stayed readable while moving fast. Lower-level safety primitives do not depend on the CLI. The API crate gets a policy gate injected at runtime rather than importing policy code directly. The store crate stays focused on state. Each boundary removes a class of later confusion.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0shh4zlm6qn5405wuc4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0shh4zlm6qn5405wuc4.jpg" alt="Editorial illustration of crate boundaries and architecture" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the safety model works
&lt;/h2&gt;

&lt;p&gt;The top-level architecture doc says the project wants four things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;explicit trust boundaries&lt;/li&gt;
&lt;li&gt;a rooted filesystem model&lt;/li&gt;
&lt;li&gt;effects classified before execution&lt;/li&gt;
&lt;li&gt;a modular implementation that can grow without a rewrite&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code follows through on that.&lt;/p&gt;

&lt;p&gt;Trust is encoded in types. Sessions are parameterized by trust level. Path handling is rooted through &lt;code&gt;WorkspaceRoot&lt;/code&gt; and &lt;code&gt;WorkspacePath&lt;/code&gt;. The policy engine works in terms of explicit effects such as &lt;code&gt;FsRead&lt;/code&gt;, &lt;code&gt;FsWrite&lt;/code&gt;, &lt;code&gt;ProcessSpawn&lt;/code&gt;, and &lt;code&gt;NetworkConnect&lt;/code&gt;. The defaults are conservative. Network is denied by default. Shell spawning is denied by default. Workspace writes require validated relative paths.&lt;/p&gt;

&lt;p&gt;That is what I want to see in an agent CLI. The system is opinionated before the first command runs.&lt;/p&gt;

&lt;p&gt;The tool path is especially solid. In the executor flow, a tool call gets looked up, classified, policy-checked, then executed. The approval behavior is explicit too:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;allow&lt;/code&gt; maps &lt;code&gt;Ask&lt;/code&gt; to &lt;code&gt;Allow&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;deny&lt;/code&gt; maps &lt;code&gt;Ask&lt;/code&gt; to &lt;code&gt;Deny&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;interactive&lt;/code&gt; preserves &lt;code&gt;Ask&lt;/code&gt;, but current comments make clear that this is effectively a deny path until the approval broker is implemented&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That sequence matters. It means the project did not fake the approval layer just to keep the demo flowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The command surface is already broad
&lt;/h2&gt;

&lt;p&gt;The first spec in &lt;code&gt;docs/specs/00_SPEC.md&lt;/code&gt; is intentionally modest. It says the initial release does not promise a production agent runtime yet. It wants to establish the boundaries needed to build one safely.&lt;/p&gt;

&lt;p&gt;That makes the current command surface more interesting, not less.&lt;/p&gt;

&lt;p&gt;The repo already supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;direct API operations through &lt;code&gt;grokrs api&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;interactive REPL chat through &lt;code&gt;grokrs chat&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;tool-calling agent execution through &lt;code&gt;grokrs agent&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;management work through &lt;code&gt;collections&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;media generation through &lt;code&gt;generate&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;model discovery through &lt;code&gt;models&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;session and store inspection through &lt;code&gt;sessions&lt;/code&gt; and &lt;code&gt;store&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;runtime posture and config inspection through &lt;code&gt;doctor&lt;/code&gt; and &lt;code&gt;show_config&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture doc also calls out &lt;code&gt;voice&lt;/code&gt;, MCP client support, search integration, encrypted reasoning replay, prompt caching, and memory tools. This is well past the point where you can call it a shell around one endpoint.&lt;/p&gt;

&lt;p&gt;What I like here is the sequencing. The project did not start with a magical agent and then backfill the boring layers. It built the capability model, the policy path, the store, and the command surface together.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real story is in the documentation system
&lt;/h2&gt;

&lt;p&gt;If you only read the code, you will understand the runtime. If you read the docs tree, you understand the development method.&lt;/p&gt;

&lt;p&gt;The visible docs include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;AGENTS.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ARCHITECTURE.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/specs/00_SPEC.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/specs/01_XAI_API_CLIENT.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/specs/02_SQLITE_STORE.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/specs/03_APPROVAL_BROKER.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/specs/04_MCP_SERVER.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/design/00_ARCHITECTURE.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/design/01_SQLITE_STATE.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/development/grokrs/01_SPEC.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/development/grokrs/03_IMPLEMENTATION_PLAN.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/development/grokrs/04_XAI_API_IMPLEMENTATION_PLAN.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/development/grokrs/05_AGENT_ORCHESTRATION_PROMPT.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/ops/00_BOOTSTRAP.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs/reviews/AI_SLOP_REVIEW_GUIDE.md&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a full process stack. It covers scope, architecture, implementation planning, operations, and review posture.&lt;/p&gt;

&lt;p&gt;I do not think these docs were written as decoration. They function as a control plane for the repo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Specs as execution boundaries
&lt;/h3&gt;

&lt;p&gt;The subsystem specs are already separated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the product spec&lt;/li&gt;
&lt;li&gt;the xAI API client spec&lt;/li&gt;
&lt;li&gt;the SQLite store spec&lt;/li&gt;
&lt;li&gt;the approval broker spec&lt;/li&gt;
&lt;li&gt;the MCP server spec&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That turns requirements into named contract surfaces.&lt;/p&gt;

&lt;p&gt;In an AI-assisted environment, that is a bigger deal than people often admit. If you want parallel work to stay coherent, you need somewhere more durable than a chat thread to define the intended behavior of a subsystem. These spec docs do that job.&lt;/p&gt;

&lt;h3&gt;
  
  
  Review artifacts as first-class outputs
&lt;/h3&gt;

&lt;p&gt;The review tree is even more revealing.&lt;/p&gt;

&lt;p&gt;Under &lt;code&gt;docs/reviews/&lt;/code&gt;, the repo includes named review domains for bootstrap, approval broker, batch extensions, collections management, document search, MCP server, remote MCP tools, responses enrichment, security hardening, SQLite store, TTS API, xAI API client, clippy pedantic cleanup, competitive features, and competitive gap analysis.&lt;/p&gt;

&lt;p&gt;The bootstrap bundle on &lt;code&gt;2026-04-05&lt;/code&gt; contains &lt;code&gt;6&lt;/code&gt; files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CONTRACT_DECLARATION.toml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;EVIDENCE_MATRIX.toml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;IMPLEMENTATION_DAG.toml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;REVIEW_READINESS.toml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TRACEABILITY.toml&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That combination is doing serious work.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CONTRACT_DECLARATION.toml&lt;/code&gt; states the promise.&lt;br&gt;
&lt;code&gt;IMPLEMENTATION_DAG.toml&lt;/code&gt; structures the work.&lt;br&gt;
&lt;code&gt;TRACEABILITY.toml&lt;/code&gt; ties implementation back to intent.&lt;br&gt;
&lt;code&gt;EVIDENCE_MATRIX.toml&lt;/code&gt; says what proof should exist.&lt;br&gt;
&lt;code&gt;REVIEW_READINESS.toml&lt;/code&gt; says when the artifact set is actually inspectable.&lt;/p&gt;

&lt;p&gt;I think that is the right abstraction. Review is not an afterthought at the end of coding. Reviewability is part of the deliverable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F976czjw5hl19oyqn57jx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F976czjw5hl19oyqn57jx.jpg" alt="Editorial illustration of DAG, evidence, and review artifacts" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why &lt;code&gt;IMPLEMENTATION_DAG.toml&lt;/code&gt; mattered so much
&lt;/h2&gt;

&lt;p&gt;Plenty of teams keep a task list. That is not the same thing.&lt;/p&gt;

&lt;p&gt;A DAG tells you which work units can move in parallel, which ones are blocked, and where integration risk sits. That is exactly what you need when multiple agents or reviewers are touching the same repo.&lt;/p&gt;

&lt;p&gt;This repo has &lt;code&gt;15&lt;/code&gt; implementation DAG files under &lt;code&gt;docs/reviews/&lt;/code&gt;. That tells me the DAG pattern was not a bootstrap stunt. It became part of the operating model.&lt;/p&gt;

&lt;p&gt;I think the big benefits are pretty concrete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;parallel work can move without guesswork about order&lt;/li&gt;
&lt;li&gt;write scopes stay smaller&lt;/li&gt;
&lt;li&gt;review can happen against a declared node instead of a vague feature story&lt;/li&gt;
&lt;li&gt;reintegration gets easier because dependencies stay visible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point matters more than people think. AI agents are good at filling in local structure. They are not naturally good at keeping a whole repo’s execution order in their head unless you give them an artifact that does that for them.&lt;/p&gt;

&lt;h2&gt;
  
  
  This repo used AI agents and kept its shape
&lt;/h2&gt;

&lt;p&gt;The top-level environment gives away a lot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.aivcs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.claude&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.sqry&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The adjacent &lt;code&gt;dag-toml-templates&lt;/code&gt; repo adds &lt;code&gt;.continue&lt;/code&gt;, &lt;code&gt;.factory&lt;/code&gt;, &lt;code&gt;.windsurf&lt;/code&gt;, and &lt;code&gt;.agents&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is an AI-native build environment. I think that is obvious.&lt;/p&gt;

&lt;p&gt;What is not obvious, and what is worth learning from, is that the repo did not let the agent workflow become the architecture.&lt;/p&gt;

&lt;p&gt;The architecture stayed in crates.&lt;br&gt;
The intent stayed in specs.&lt;br&gt;
The execution order stayed in DAGs.&lt;br&gt;
The proof stayed in evidence and traceability artifacts.&lt;/p&gt;

&lt;p&gt;That changes the role of the model. The model is not only there to generate code. It is expected to leave behind planning state, review state, and proof state too.&lt;/p&gt;

&lt;p&gt;That is a much healthier arrangement.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;sqry&lt;/code&gt; was a good fit for this codebase
&lt;/h2&gt;

&lt;p&gt;I checked the semantic index while reviewing the repo. It exposed &lt;code&gt;102&lt;/code&gt; files and &lt;code&gt;33,260&lt;/code&gt; indexed symbols in the &lt;code&gt;grokrs&lt;/code&gt; workspace.&lt;/p&gt;

&lt;p&gt;That matters because simple text search is not enough for some of the questions you actually want to ask in a codebase like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;where does policy gating happen&lt;/li&gt;
&lt;li&gt;which commands route through the same transport bridge&lt;/li&gt;
&lt;li&gt;which tools are exposed to the model&lt;/li&gt;
&lt;li&gt;where is session state persisted&lt;/li&gt;
&lt;li&gt;which tests cover a given path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This repo has enough structure that semantic navigation pays for itself. The DAG and review system also make semantic tooling more useful because the work is already decomposed into named slices.&lt;/p&gt;

&lt;h2&gt;
  
  
  The most interesting adjacent repo is &lt;code&gt;dag-toml-templates&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;If &lt;code&gt;grokrs&lt;/code&gt; shows the current operating model, &lt;code&gt;/srv/repos/internal/verivusai-labs/dag-toml-templates&lt;/code&gt; shows where that model is going.&lt;/p&gt;

&lt;p&gt;Its &lt;code&gt;README.md&lt;/code&gt; still presents a canonical versioned release surface for template packages. That matters. The file-based layer is not being discarded.&lt;/p&gt;

&lt;p&gt;At the same time, the research and design docs are explicit about the next move.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;research/DATABASE_REPLACEMENT_RESEARCH.md&lt;/code&gt; frames the problem as &lt;code&gt;TOML DAG Templates → Structured Database&lt;/code&gt;. It evaluates database candidates for the three process-control packages.&lt;/p&gt;

&lt;p&gt;The final ranking puts &lt;code&gt;SurrealDB 3.0&lt;/code&gt; first with &lt;code&gt;100/130&lt;/code&gt;. The recommended architecture keeps database state in SurrealDB while exporting and importing through &lt;code&gt;aivcs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Then &lt;code&gt;docs/superpowers/specs/2026-04-06-v2-surrealdb-adoption-design.md&lt;/code&gt; makes the transition explicit. It defines v2 as a &lt;code&gt;SurrealDB-Backed Hybrid Template Pack&lt;/code&gt;. The important word there is &lt;code&gt;Hybrid&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I think that is the correct direction.&lt;/p&gt;

&lt;p&gt;The point is not to throw away TOML. The point is to stop asking static files to act like a live workflow database.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the &lt;code&gt;dagdb&lt;/code&gt; package already proves
&lt;/h3&gt;

&lt;p&gt;The implementation under &lt;code&gt;src/dagdb/&lt;/code&gt; is not theoretical.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;migrate.py&lt;/code&gt; includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;detect_toml_type()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;import_toml_file()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;import_dag()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;import_traceability()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;import_review_readiness()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;export_dag()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means the system can ingest the three major TOML package families into a database model and, for DAGs, reconstruct TOML-compatible output on the way back out.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;history.py&lt;/code&gt; adds &lt;code&gt;unit_history&lt;/code&gt; and &lt;code&gt;edge_history&lt;/code&gt;, plus &lt;code&gt;get_dag_state_at()&lt;/code&gt; for point-in-time reconstruction.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;invariants.py&lt;/code&gt; classifies which invariants can live in the database and which still need application code.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;schema_migration.py&lt;/code&gt; adds &lt;code&gt;schema_migrations&lt;/code&gt;, apply and rollback behavior, and explicit migration-state handling.&lt;/p&gt;

&lt;p&gt;This is the part I find most persuasive. The move from TOML to structured state is not being described only in prose. It is being built as code.&lt;/p&gt;

&lt;h3&gt;
  
  
  The prototype results are honest
&lt;/h3&gt;

&lt;p&gt;The prototype evaluation is one of the better research-to-implementation bridges I have seen in a repo like this.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;research/SURREALDB_PROTOTYPE_EVALUATION.md&lt;/code&gt; reports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;34&lt;/code&gt; total tables and edge tables across DAG, traceability, and review-readiness data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;13&lt;/code&gt; of &lt;code&gt;29&lt;/code&gt; validator checks classified as &lt;code&gt;db_enforced&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;9&lt;/code&gt; of &lt;code&gt;29&lt;/code&gt; classified as &lt;code&gt;query_checkable&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;7&lt;/code&gt; of &lt;code&gt;29&lt;/code&gt; classified as &lt;code&gt;app_required&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;75.9%&lt;/code&gt; combined DB-plus-query coverage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;347&lt;/code&gt; collected tests&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;343&lt;/code&gt; passes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;4&lt;/code&gt; xfails tied to time-travel support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most useful finding in that report is the limitation section. The &lt;code&gt;VERSION&lt;/code&gt; clause is accepted syntactically. It does not actually perform historical time travel in the tested embedded setup. The repo says that plainly and uses explicit history tables as the workaround.&lt;/p&gt;

&lt;p&gt;I trust systems more when they write down their failed assumptions.&lt;/p&gt;

&lt;p&gt;The invariant classification is just as good. The repo does not pretend a database will magically solve graph algorithms. It says computed values like entry points, leaf nodes, critical path, and maximum parallelism still belong in application code.&lt;/p&gt;

&lt;p&gt;That is the split I would want too:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;database for durable state, relations, audit, and constraints&lt;/li&gt;
&lt;li&gt;application for graph algorithms and orchestration logic&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What dev.to readers can copy from this repo
&lt;/h2&gt;

&lt;p&gt;I do not think most teams need this exact stack. I do think more teams should steal the shape of it.&lt;/p&gt;

&lt;p&gt;If you are building an AI-heavy internal tool, these are the pieces I would copy first.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Put the safety model in types
&lt;/h3&gt;

&lt;p&gt;Do not leave trust, path safety, and effect handling as loose runtime conventions. Make them visible in the type system and module boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Write subsystem specs before parallel AI work starts
&lt;/h3&gt;

&lt;p&gt;You do not need long specs. You do need named contract surfaces. A short subsystem spec beats a hundred lines of prompt history.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Use a DAG when work will be parallel
&lt;/h3&gt;

&lt;p&gt;A task list is fine for one human. A DAG is better when several workers, human or model, are moving at once.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Require proof artifacts, not just diffs
&lt;/h3&gt;

&lt;p&gt;This repo’s contract, evidence, traceability, and readiness files are doing something very practical. They force work to explain itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Keep a human-reviewable export surface
&lt;/h3&gt;

&lt;p&gt;The move in &lt;code&gt;dag-toml-templates&lt;/code&gt; is not file versus database. It is file plus database. I think that is the right model for most engineering systems with both human review and live workflow state.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I think is the real lesson
&lt;/h2&gt;

&lt;p&gt;The interesting result here is not that AI models can write a lot of code quickly. We already know that.&lt;/p&gt;

&lt;p&gt;The interesting result is that a repo can move quickly, use multiple agents, accumulate real functionality, and still stay reviewable if the team is strict about where meaning lives.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;grokrs&lt;/code&gt;, meaning lives in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the crate graph&lt;/li&gt;
&lt;li&gt;the spec docs&lt;/li&gt;
&lt;li&gt;the implementation DAGs&lt;/li&gt;
&lt;li&gt;the evidence and traceability artifacts&lt;/li&gt;
&lt;li&gt;the policy model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why the repo still feels like engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evidence note
&lt;/h2&gt;

&lt;p&gt;I grounded this article in repository-visible evidence available on &lt;code&gt;2026-04-06&lt;/code&gt;, including the root workspace manifest, crate layout, architecture and spec docs, dated review artifacts, command and module structure, semantic index results, and the adjacent &lt;code&gt;dag-toml-templates&lt;/code&gt; design and research documents.&lt;/p&gt;

&lt;p&gt;Some claims about the exact outer AI/VCS orchestration layer remain inference rather than direct confirmation, because those metadata surfaces are suggestive but not fully self-describing on their own.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>ai</category>
      <category>security</category>
      <category>cli</category>
    </item>
    <item>
      <title>How We Used AI Agents to Security-Audit an Open Source Project</title>
      <dc:creator>Verivus OSS Releases</dc:creator>
      <pubDate>Sun, 05 Apr 2026 07:30:20 +0000</pubDate>
      <link>https://dev.to/verivusossreleases/how-we-used-ai-agents-to-security-audit-an-open-source-project-2g41</link>
      <guid>https://dev.to/verivusossreleases/how-we-used-ai-agents-to-security-audit-an-open-source-project-2g41</guid>
      <description>&lt;p&gt;&lt;em&gt;Using sqry's code graph, parallel audit agents, and iterative Codex review to contribute security improvements to gstack.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Garry Tan open-sourced &lt;a href="https://github.com/garrytan/gstack" rel="noopener noreferrer"&gt;gstack&lt;/a&gt; on March 11, 2026. It is a CLI toolkit for Claude Code with a headless browser, Chrome extension, skill system, and telemetry layer. The project attracted 30+ PR authors within its first few weeks.&lt;/p&gt;

&lt;p&gt;We wanted to contribute something useful. Security review seemed like the right fit. A headless browser that spawns subprocesses and handles cookies has a large attack surface, and security work tends to fall to the bottom of every fast-moving project's priority list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you haven't read our earlier posts:&lt;/strong&gt; &lt;a href="https://github.com/verivus-oss/sqry" rel="noopener noreferrer"&gt;sqry&lt;/a&gt; is an AST-based code search tool. It parses code like a compiler, building a graph of functions, classes, imports, and call relationships across 35+ languages. &lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;llm-cli-gateway&lt;/a&gt; orchestrates multiple LLMs (Claude, Codex, Gemini) through a single MCP interface. The &lt;a href="https://medium.com/@wernerk/the-codex-review-gate-how-we-made-ai-agents-review-each-others-work-59e9ff5465f9" rel="noopener noreferrer"&gt;Codex review gate&lt;/a&gt; is our practice of requiring unconditional Codex approval before shipping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Codebase
&lt;/h2&gt;

&lt;p&gt;At the time of our audit (late March 2026, against the &lt;code&gt;main&lt;/code&gt; branch as of March 30), gstack had about 47,000 symbols across 212 files in TypeScript, JavaScript, HTML, CSS, Shell, Ruby, JSON, and SQL. The browse subsystem's &lt;code&gt;handleWriteCommand&lt;/code&gt; function was roughly 715 lines with a complexity score of 58. The Chrome extension injects into every page the user visits. The sidebar agent spawns Claude subprocesses from a JSONL queue file.&lt;/p&gt;

&lt;p&gt;Running &lt;code&gt;grep "exec"&lt;/code&gt; on this codebase returns 60+ matches. None of them look obviously wrong. Security review requires understanding relationships between functions, not just finding keywords.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why grep Falls Short
&lt;/h2&gt;

&lt;p&gt;In a &lt;a href="https://medium.com/@wernerk/the-code-question-grep-cant-answer-057bfc8d7fe2" rel="noopener noreferrer"&gt;previous article&lt;/a&gt;, I described why structural code search matters for this kind of work.&lt;/p&gt;

&lt;p&gt;Say you want to find every path from user input to a dangerous sink like &lt;code&gt;Bun.spawn()&lt;/code&gt;. grep finds the spawn calls. It does not tell you which functions call those functions, which HTTP endpoints call &lt;em&gt;those&lt;/em&gt; functions, or whether any validation sits between the endpoint and the spawn.&lt;/p&gt;

&lt;p&gt;sqry made this practical. For gstack, it built a graph of 46,837 nodes and 39,083 edges in 280ms. With all 36 language plugins enabled (including high-cost plugins like JSON and ServiceNow XML), the full graph captures 55,365 raw edges across 212 files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;sqry index . --force --include-high-cost

Files indexed:  212
Symbols:        46,837
Edges:          39,083 canonical (55,365 raw)
Plugins:        36 active
Build time:     280ms
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Round 1: 10 Findings, 3 LLMs
&lt;/h2&gt;

&lt;p&gt;Our first audit in March used three LLMs in separate roles. Claude and Codex each independently found overlapping but non-identical sets of issues. Gemini then verified all findings against source code. The total was 10 unique security findings across gstack's browse server, Chrome extension, design CLI, and telemetry layer. We submitted &lt;a href="https://github.com/garrytan/gstack/pull/664" rel="noopener noreferrer"&gt;PR #664&lt;/a&gt; with fixes and filed 10 public security issues (#665-#670, #672-#675). We disclosed publicly because gstack is a developer tool running locally, not a production service handling user data — the risk profile favors transparency over coordinated disclosure.&lt;/p&gt;

&lt;p&gt;What gave us confidence these were real: three other contributors (stedfn, Gonzih, and mehmoodosman) independently found at least 6 of the same issues through separate analysis. Based on the public timeline, their PRs were filed after our issues and showed no references to our reports, suggesting independent discovery. Convergence from different methods and different people is strong validation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Round 2: 20 More Findings
&lt;/h2&gt;

&lt;p&gt;For the second audit, we expanded the approach. We dispatched 4 parallel audit agents instead of manually querying sqry:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent 1&lt;/strong&gt;: &lt;code&gt;server.ts&lt;/code&gt;, covering HTTP endpoints, auth, and CORS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent 2&lt;/strong&gt;: &lt;code&gt;write-commands.ts&lt;/code&gt;, the highest-complexity function, covering file ops and cookie handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent 3&lt;/strong&gt;: &lt;code&gt;meta-commands.ts&lt;/code&gt;, covering command parsing, state management, and frame targeting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent 4&lt;/strong&gt;: &lt;code&gt;extension/&lt;/code&gt;, covering the Chrome extension sidepanel, inspector, and background worker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each agent had full sqry MCP access with instructions to look for issues beyond the 10 we had already reported. They returned 25 raw findings. After cross-referencing against 20+ existing community issues and the maintainer's own security work (he had already landed two security-focused PRs), 16 were new. Four more gaps turned up during implementation review. The severity classifications below are ours, based on our assessment of impact and prerequisites — the maintainer may classify them differently.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Subtle but Serious Finding
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# bin/gstack-learnings-search, lines 46-52&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;FILES&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;/dev/null | bun &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"
const type = '&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TYPE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;';
const query = '&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;QUERY&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;'.toLowerCase();
const limit = &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;LIMIT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;;
const slug = '&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SLUG&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;';
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bash variables are interpolated directly into JavaScript string literals via &lt;code&gt;bun -e&lt;/code&gt;. A branch name containing a single quote, like &lt;code&gt;fix'; process.exit(1); //&lt;/code&gt;, would break out of the JS string and execute arbitrary code. Easy to write, hard to spot in review.&lt;/p&gt;

&lt;p&gt;The fix: pass parameters via environment variables instead of string interpolation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;FILES&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;/dev/null | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;GSTACK_FILTER_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TYPE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;GSTACK_FILTER_QUERY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$QUERY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;GSTACK_FILTER_LIMIT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LIMIT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bun &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"
const type = process.env.GSTACK_FILTER_TYPE || '';
const query = (process.env.GSTACK_FILTER_QUERY || '').toLowerCase();
const limit = parseInt(process.env.GSTACK_FILTER_LIMIT || '10', 10) || 10;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Environment variables are never interpreted as code. The injection vector disappears.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Finding sqry Made Possible
&lt;/h3&gt;

&lt;p&gt;sqry's &lt;code&gt;find_cycles&lt;/code&gt; tool detected a mutual recursion between &lt;code&gt;switchChatTab&lt;/code&gt; and &lt;code&gt;pollChat&lt;/code&gt; in the Chrome extension's sidepanel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;switchChatTab -&amp;gt; pollChat -&amp;gt; switchChatTab (cycle depth: 2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pollChat&lt;/code&gt; fetches the server's active tab ID. If it differs from the client's, it calls &lt;code&gt;switchChatTab&lt;/code&gt;. &lt;code&gt;switchChatTab&lt;/code&gt; sets state and immediately calls &lt;code&gt;pollChat&lt;/code&gt;. If the server keeps returning a different tab ID during rapid switching, this creates unbounded stack recursion.&lt;/p&gt;

&lt;p&gt;grep alone will not reveal this relationship. The bug lives in the interaction between two functions, and that interaction only becomes visible in the call graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Full List
&lt;/h3&gt;

&lt;p&gt;We classified findings on a four-level scale: HIGH means an attacker can execute arbitrary code or exfiltrate data with minimal prerequisites. MED-HIGH means significant impact but requiring local access or a specific precondition. MED means the issue requires local access, specific conditions, or produces limited impact. LOW covers hardening gaps and defense-in-depth improvements.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Finding&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Shell injection via bash-to-JS interpolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;MED-HIGH&lt;/td&gt;
&lt;td&gt;Queue file permissions allow local prompt injection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/health&lt;/code&gt; endpoint exposes user activity without auth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;ReDoS via &lt;code&gt;new RegExp(userInput)&lt;/code&gt; in frame targeting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chain&lt;/code&gt; command bypasses watch-mode write guard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cookie-import&lt;/code&gt; allows cross-domain cookie planting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;CSS values unvalidated at 4 injection points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;Session directory traversal via crafted &lt;code&gt;active.json&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;responsive&lt;/code&gt; screenshots skip path validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;validateOutputPath&lt;/code&gt; uses &lt;code&gt;path.resolve&lt;/code&gt;, not &lt;code&gt;realpathSync&lt;/code&gt;*&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;state load&lt;/code&gt; navigates to unvalidated URLs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;DOM serialization round-trip enables XSS on tab switch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;switchChatTab&lt;/code&gt;/&lt;code&gt;pollChat&lt;/code&gt; mutual recursion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;MED&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cookie-import-browser --domain&lt;/code&gt; accepts unvalidated input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15-20&lt;/td&gt;
&lt;td&gt;LOW&lt;/td&gt;
&lt;td&gt;Info disclosure, timeout handling, bounds validation, prompt injection surface&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;*Finding 10 is a common pattern worth highlighting:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// BEFORE: resolves logically, symlinks pass through&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resolved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// /tmp/safe -&amp;gt; still "/tmp/safe"&lt;/span&gt;

&lt;span class="c1"&gt;// AFTER: resolves physically, symlinks followed to real target&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resolved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;realpathSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// /tmp/safe -&amp;gt; "/etc/shadow" (blocked!)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A symlink at &lt;code&gt;/tmp/safe&lt;/code&gt; pointing to &lt;code&gt;/etc&lt;/code&gt; would pass &lt;code&gt;path.resolve&lt;/code&gt; validation but fail &lt;code&gt;realpathSync&lt;/code&gt;, because the real path is outside the safe directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Codex Review Gate
&lt;/h2&gt;

&lt;p&gt;In a &lt;a href="https://medium.com/@wernerk/the-codex-review-gate-how-we-made-ai-agents-review-each-others-work-59e9ff5465f9" rel="noopener noreferrer"&gt;previous article&lt;/a&gt;, I described how we use Codex as a mandatory review gate. Unconditional approval or the work does not ship. Codex earned this role through specificity. Where a generic reviewer might say "consider improving error handling," Codex pinpoints "the catch block on line 47 swallows errors silently." It also has a low false-positive rate, which keeps the gate credible over time.&lt;/p&gt;

&lt;p&gt;For this security plan, Codex went through &lt;strong&gt;9 rounds&lt;/strong&gt; before approving. That says more about our work than the tool. Three examples of what it caught:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Round 2&lt;/strong&gt;: Our queue validator used &lt;code&gt;string&lt;/code&gt; for &lt;code&gt;tabId&lt;/code&gt; when the actual writer emits &lt;code&gt;number&lt;/code&gt;. A type mismatch that would have caused the validator to reject every real queue entry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Round 5&lt;/strong&gt;: &lt;code&gt;null&lt;/code&gt; values (which the real writer produces for optional fields) would be rejected by our schema. The validator was correct in theory but wrong against the actual data format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Round 8&lt;/strong&gt;: Our test extracted a 1500-character slice from the source file to validate against. That slice bled into adjacent functions, meaning the test could pass even without the fix being applied. The final solution: a brace-walking function body extractor that isolates exactly the target function.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each round made the plan more precise. The full 9-round breakdown is in the &lt;a href="https://github.com/garrytan/gstack/pull/806" rel="noopener noreferrer"&gt;PR #806&lt;/a&gt; discussion. The discipline of submitting to review — and actually fixing what is found — is where the quality comes from.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation: Subagent-Driven Development
&lt;/h2&gt;

&lt;p&gt;With an approved plan, we dispatched one implementation subagent per task, 18 tasks total. Each subagent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the specific source files&lt;/li&gt;
&lt;li&gt;Created failing tests&lt;/li&gt;
&lt;li&gt;Implemented the fix&lt;/li&gt;
&lt;li&gt;Verified tests pass&lt;/li&gt;
&lt;li&gt;Committed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A mid-implementation code review by a separate review agent caught 4 additional gaps we had missed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;applyStyle&lt;/code&gt; in the extension was missing the same CSS validation added to 3 other injection points&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;snapshot.ts&lt;/code&gt; still used the old &lt;code&gt;path.resolve&lt;/code&gt; pattern&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stateFile&lt;/code&gt; in queue entries had no path traversal check&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cookie-import&lt;/code&gt;'s read path validation used the old pattern&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All fixed before continuing. That is why you review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test Results
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Security regression tests: 119 pass, 0 fail [47ms]
E2E evals (Docker + Chromium): 33 pass, 0 regressions
Previously-failing browse tests: all 3 now pass
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The E2E evals ran inside a Docker container (Ubuntu 24.04, Chromium 145, Playwright 1.58.2, &lt;code&gt;--cap-add SYS_ADMIN&lt;/code&gt; for the Chromium sandbox). One test outside the security suite (&lt;code&gt;qa-bootstrap&lt;/code&gt;) failed due to test infrastructure — it is not included in the 33 count above.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Landed
&lt;/h2&gt;

&lt;p&gt;On April 6, the maintainer cherry-picked both our first round (&lt;a href="https://github.com/garrytan/gstack/pull/664" rel="noopener noreferrer"&gt;PR #664&lt;/a&gt;) and second round (&lt;a href="https://github.com/garrytan/gstack/pull/806" rel="noopener noreferrer"&gt;PR #806&lt;/a&gt;) onto the &lt;code&gt;garrytan/security-wave-5&lt;/code&gt; branch with co-author credit. They are part of &lt;a href="https://github.com/garrytan/gstack/pull/847" rel="noopener noreferrer"&gt;PR #847&lt;/a&gt;, which bundles fixes from 8 community PRs across 4 contributors. That PR is open and under review at time of writing.&lt;/p&gt;

&lt;p&gt;This did not happen immediately. On April 5, the maintainer merged &lt;a href="https://github.com/garrytan/gstack/pull/810" rel="noopener noreferrer"&gt;PR #810&lt;/a&gt; ("security wave 1"), which cherry-picked fixes from Gonzih and garagon — contributors who had independently found several of the same issues we reported in our round 1 issues (#665-#670, #672-#675), filed on March 30. At that point our PRs were still open without comment.&lt;/p&gt;

&lt;p&gt;We flagged four gaps in that initial wave:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;validateOutputPath&lt;/code&gt; was only fixed in one of three copies.&lt;/strong&gt; The identical vulnerable function in &lt;code&gt;meta-commands.ts&lt;/code&gt; and inline validation in &lt;code&gt;snapshot.ts&lt;/code&gt; still used &lt;code&gt;path.resolve&lt;/code&gt; without &lt;code&gt;realpathSync&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The fix broke on macOS.&lt;/strong&gt; &lt;code&gt;SAFE_DIRECTORIES&lt;/code&gt; contained &lt;code&gt;/tmp&lt;/code&gt;, but on macOS &lt;code&gt;/tmp&lt;/code&gt; is a symlink to &lt;code&gt;/private/tmp&lt;/code&gt;. &lt;code&gt;realpathSync&lt;/code&gt; resolves through it, causing legitimate screenshots to be rejected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No queue entry schema validation.&lt;/strong&gt; File permissions were added, but queue entry contents were not validated against type checks or path traversal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/health&lt;/code&gt; still leaked user activity.&lt;/strong&gt; The unauthenticated response returned the user's current URL and sidebar AI message text.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All four gaps are addressed in the security wave 5 PR. The maintainer included garagon's #820 (symlink resolution in meta-commands), our queue validation and &lt;code&gt;/health&lt;/code&gt; fixes from #806, and the full set of CSS injection guards, cookie domain validation, reentrancy guards, and SIGKILL escalation across both our rounds.&lt;/p&gt;

&lt;p&gt;The PR summary lists 20 security fixes with 750+ lines of new regression tests, attributed jointly to "@mr-k-man, @garagon." Most of those 20 fixes came from our two PRs (#664 and #806). garagon contributed three — shell injection env vars (#819), meta-commands symlink resolution (#820), and upload path validation (#821) — two of which address issues we originally reported. The commit history in #847 shows separate cherry-picks for each source PR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The timeline is common in open source security work.&lt;/strong&gt; We filed issues and PRs on March 30. Other contributors independently found overlapping issues. The maintainer triaged and cherry-picked fixes in waves over 7 days, starting with the most urgent. Our work was picked up last but included completely, with co-author attribution. Open source security work often lands asynchronously and in waves. Thorough reports with working patches tend to get recognized, even when the initial response is silence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Toolkit
&lt;/h2&gt;

&lt;p&gt;Everything described here uses two open-source tools:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/verivus-oss/sqry" rel="noopener noreferrer"&gt;sqry&lt;/a&gt;&lt;/strong&gt;: AST-based semantic code search. Builds a graph of symbols and relationships across 35+ languages. Exposes 34 MCP tools for AI agents to navigate code structurally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;llm-cli-gateway&lt;/a&gt;&lt;/strong&gt;: Multi-LLM orchestration via MCP. Routes requests through Claude, Codex, and Gemini with session continuity, async job management, and approval gates.&lt;/p&gt;

&lt;p&gt;Both are MIT-licensed. sqry runs entirely locally. llm-cli-gateway runs locally but routes requests to remote LLM APIs (Claude, Codex, Gemini).&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Independent convergence validates methodology.&lt;/strong&gt; When other contributors find the same issues through completely different methods, you can trust the results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rigorous review improves your own work most of all.&lt;/strong&gt; 9 rounds of Codex review sounds like a lot. It was. Every round caught something real. The discipline of submitting to review, and actually fixing what is found, is where the quality comes from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structural search finds what text search misses.&lt;/strong&gt; The &lt;code&gt;switchChatTab&lt;/code&gt;/&lt;code&gt;pollChat&lt;/code&gt; recursion, the &lt;code&gt;validateOutputPath&lt;/code&gt; symlink bypass, the CSS injection across 4 separate code paths — these are relationship issues. Understanding code structure is different from searching code text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security review is a good way to serve the open source community.&lt;/strong&gt; Every maintainer has more feature requests than they can handle. A thorough security review with fixes, tests, and documentation is work that helps everyone who uses the project. We are grateful gstack is open source and that we could contribute.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The full security audit report, implementation plan, and all test results are in &lt;a href="https://github.com/garrytan/gstack/pull/806" rel="noopener noreferrer"&gt;PR #806&lt;/a&gt;. The round 1 report is in &lt;a href="https://github.com/garrytan/gstack/pull/664" rel="noopener noreferrer"&gt;PR #664&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;sqry: &lt;a href="https://github.com/verivus-oss/sqry" rel="noopener noreferrer"&gt;github.com/verivus-oss/sqry&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;llm-cli-gateway: &lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;github.com/verivus-oss/llm-cli-gateway&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>opensource</category>
      <category>ai</category>
      <category>typescript</category>
    </item>
    <item>
      <title>How to Set Up Multi-LLM Code Review with Claude, Codex, and Gemini</title>
      <dc:creator>Verivus OSS Releases</dc:creator>
      <pubDate>Tue, 31 Mar 2026 02:55:56 +0000</pubDate>
      <link>https://dev.to/verivusossreleases/how-to-set-up-multi-llm-code-review-with-claude-codex-and-gemini-5d8h</link>
      <guid>https://dev.to/verivusossreleases/how-to-set-up-multi-llm-code-review-with-claude-codex-and-gemini-5d8h</guid>
      <description>&lt;p&gt;Every LLM has blind spots. Claude is strong on architecture and design patterns. Codex catches logic bugs and missing error handling. Gemini is thorough on security issues and edge cases. Using just one reviewer means you are only getting one perspective.&lt;/p&gt;

&lt;p&gt;This tutorial walks through setting up &lt;strong&gt;llm-cli-gateway&lt;/strong&gt; -- an MCP server that wraps the Claude Code, Codex, and Gemini CLIs -- and running a parallel code review that combines all three.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;You need the CLI tools installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Claude Code&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @anthropic-ai/claude-code

&lt;span class="c"&gt;# Codex&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @openai/codex
codex login

&lt;span class="c"&gt;# Gemini&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @google/gemini-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You do not need all three. The gateway works with whichever CLIs you have installed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Install the Gateway
&lt;/h2&gt;

&lt;p&gt;Add it to your MCP client configuration. If you use Claude Code, edit &lt;code&gt;~/.claude/settings.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"llm-gateway"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm-cli-gateway"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire setup. The gateway discovers your installed CLIs automatically via PATH resolution (including &lt;code&gt;~/.local/bin&lt;/code&gt; and NVM paths).&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Verify Your Setup
&lt;/h2&gt;

&lt;p&gt;Once connected, confirm which CLIs are available:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;list_models&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This returns the available models for each detected CLI. If a CLI is not installed, it will not appear in the output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Run a Parallel Code Review
&lt;/h2&gt;

&lt;p&gt;Here is the core workflow. You send the same codebase to all three LLMs, each with a prompt tuned to its strengths.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude -- Architecture and Quality
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;claude_request(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Review the changes in src/auth/ for architecture, design patterns, maintainability, and documentation gaps. Read the files directly. Provide specific line numbers and suggested fixes."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"optimizePrompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"optimizeResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Codex -- Logic and Correctness
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;codex_request(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Analyze src/auth/ for logic bugs, off-by-one errors, missing error handling, race conditions, and test coverage gaps. Read the files directly. Rate each finding: critical, high, medium, or low."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullAuto"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"optimizePrompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"optimizeResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Gemini -- Security and Edge Cases
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;gemini_request(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Security audit of src/auth/: check for injection vulnerabilities, authentication bypasses, data leaks, OWASP Top 10 violations, and crash-causing edge cases. Read the files directly."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemini-2.5-pro"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"optimizePrompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"optimizeResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In an MCP client like Claude Code, you can fire all three of these as parallel tool calls in a single turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Handle Long-Running Reviews
&lt;/h2&gt;

&lt;p&gt;Code reviews on large files can take over a minute. The gateway handles this transparently.&lt;/p&gt;

&lt;p&gt;Any sync request that exceeds 45 seconds automatically becomes an async job. Instead of timing out, you get back a job reference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deferred"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"jobId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"abc-123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Running in background. Poll with llm_job_status."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check on it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;llm_job_status(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"jobId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"abc-123"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When status is &lt;code&gt;completed&lt;/code&gt;, fetch the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;llm_job_result(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"jobId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"abc-123"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a review is stuck, cancel it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;llm_job_cancel(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"jobId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"abc-123"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5: Synthesize the Results
&lt;/h2&gt;

&lt;p&gt;Once all three reviews come back, combine them. Here is a structured approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Deduplicate.&lt;/strong&gt; Multiple LLMs will often flag the same issue. Merge these and note which LLMs agreed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prioritize.&lt;/strong&gt; Critical findings first, then high, medium, low. If two or more LLMs flag the same thing as critical, it almost certainly is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-validate unique findings.&lt;/strong&gt; When only one LLM finds something, verify it. Gemini-only security findings are usually real. Single-LLM style complaints are usually noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Categorize.&lt;/strong&gt; Group by Security, Correctness, Performance, and Maintainability.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The output should look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Code Review Summary&lt;/span&gt;

&lt;span class="gu"&gt;### Critical (must fix)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; SQL injection in login handler (line 47) -- found by Gemini, confirmed by Codex

&lt;span class="gu"&gt;### High&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Missing error handling on token refresh (line 112) -- found by Codex
&lt;span class="p"&gt;-&lt;/span&gt; Session fixation vulnerability (line 89) -- found by Gemini

&lt;span class="gu"&gt;### Medium&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Duplicated validation logic across handlers -- found by Claude
&lt;span class="p"&gt;-&lt;/span&gt; No rate limiting on auth endpoints -- found by Gemini, noted by Claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 6: Fix and Verify
&lt;/h2&gt;

&lt;p&gt;Send the consolidated findings back through Codex for fixes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;codex_request(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Fix the following issues in src/auth/:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;1. [Critical] SQL injection in login handler, line 47 - use parameterized queries&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;2. [High] Missing error handling on token refresh, line 112&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;3. [High] Session fixation vulnerability, line 89 - regenerate session on login&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;Apply fixes and update tests."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullAuto"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"optimizePrompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run your test suite. If tests pass, you have a review cycle that caught issues no single LLM would have found alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using Sessions for Multi-Turn Reviews
&lt;/h2&gt;

&lt;p&gt;For larger reviews that require back-and-forth, create sessions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;session_create(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cli"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Auth module review"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"setAsActive"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Subsequent &lt;code&gt;claude_request&lt;/code&gt; calls with &lt;code&gt;continueSession: true&lt;/code&gt; will use the Claude CLI's &lt;code&gt;--continue&lt;/code&gt; flag, maintaining real conversation context. Gemini sessions use &lt;code&gt;--resume&lt;/code&gt; for the same effect.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;claude_request(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Look at the token refresh logic more carefully. Is the retry backoff correct?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"continueSession"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Optional: Approval Gates
&lt;/h2&gt;

&lt;p&gt;For high-risk operations, enable approval gates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;codex_request(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Refactor the authentication module"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullAuto"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"approvalStrategy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mcp_managed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"approvalPolicy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"strict"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gateway scores the operation's risk and records an approval decision before execution. Review past decisions with &lt;code&gt;approval_list()&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Is (and Is Not)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;llm-cli-gateway wraps CLI binaries, not APIs.&lt;/strong&gt; It spawns &lt;code&gt;claude&lt;/code&gt;, &lt;code&gt;codex&lt;/code&gt;, and &lt;code&gt;gemini&lt;/code&gt; as child processes. You get the full CLI experience -- tool use, sandboxing, file access, your existing authentication and billing. There is no API key to configure for the gateway itself.&lt;/p&gt;

&lt;p&gt;This means it does not work like LiteLLM or other API proxy tools. It cannot run in a cloud environment without the CLIs installed. It is designed for local development machines where you already have these tools.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Without consultation, plans are frustrated, but with many counselors they succeed."&lt;/em&gt; -- Proverbs 15:22&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;npm:&lt;/strong&gt; &lt;a href="https://npmjs.com/package/llm-cli-gateway" rel="noopener noreferrer"&gt;llm-cli-gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;verivus-oss/llm-cli-gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Registry:&lt;/strong&gt; &lt;code&gt;io.github.verivus-oss/llm-cli-gateway&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT, by &lt;a href="https://github.com/verivus-oss" rel="noopener noreferrer"&gt;VerivusAI Labs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>mcp</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
