DEV Community: Verivus OSS Releases

Multiple Agents, Multiple Workstreams, and the Parts That Still Break

Verivus OSS Releases — Thu, 16 Apr 2026 08:12:42 +0000

Multiple Agents, Multiple Workstreams, and the Parts That Still Break

I think the current debate around coding agents gets flattened too quickly.

One side says multiple agents are already here. Separate worktrees, specialized roles, parallel streams of work, and a measurable boost in throughput. The other side says a lot of these systems still over-promise, stall, and leave too much coordination work on the human operator.

After looking at our own repo activity and fixing a real compatibility break in grokrs, I think both sides are seeing something real.

The leverage is real.

The fragility is real too.

The weak version of the debate is already over

The weakest version of the debate is whether multiple agents or multiple workstreams can happen at the same time at all. In our environment, they clearly can.

I checked recent repo activity across /srv/repos/internal/verivusai-labs and /srv/repos/public and looked for same-hour overlap in both repository work and agent-specific metadata directories.

Here is the short version:

41 git repos scanned
25 with activity in the last four weeks
busiest days: 2026-03-21 and 2026-04-05, both with 7 active repos

There were also clear same-hour overlaps between agents and repos:

Ghost on 2026-03-30: Claude, Codex, and Cursor active in the same hour
arctos on 2026-04-02: Claude and AIVCS active in the same hour
GitNexus on 2026-04-05: Claude and Cursor active in the same hour

That is not a vibes-based claim. It is timestamped concurrent work.

So I do not think “multiple workstreams are fake” is a serious position anymore. The better question is whether multiple agents can work in parallel in a way that is reliable, observable, and cheap to integrate.

That is where things get more interesting.

What actually seems to break first

From what I can see, the first failures usually happen around the agent system, not inside the basic idea of parallelism itself.

1. Isolation

This is why worktrees keep coming up in the strongest pro-agent posts.

If multiple agents share mutable state carelessly, they interfere with each other. They overwrite assumptions, pollute local context, and turn parallel work into a race condition.

The useful claim is not “I launched a bunch of agents.” The useful claim is “I gave them isolated execution surfaces, so they could run without stepping on each other.”

2. Visibility

One of the better skeptical complaints I saw was from Demir Bülbüloğlu on 2026-02-22. The complaint was not just that a system failed. It was that the system claimed to be running multiple agents and then stalled instead of finishing.

That matters because it points to a gap between claimed concurrency and observable concurrency.

Once a system says it is running multiple agents, the operator needs answers to a few basic questions:

which task is active
which agent owns which workspace
whether a tool call finished, failed, or retried
whether output was actually produced or quietly dropped

Without that, “multiple workstreams” is not really a workflow model. It is hidden state with a strong marketing wrapper.

3. Protocol drift

I got a very practical reminder of this while repairing grokrs --x-search.

The break had nothing to do with whether X search was conceptually possible. It was a compatibility problem:

grokrs still emitted top-level search_parameters
xAI now expects search configuration on the tool objects themselves
the old shape fell onto the deprecated Live Search path and returned HTTP 410

After fixing that request shape, more drift showed up:

newer Responses payloads included output_text
newer tool-backed responses also carried server-side tool usage in a shape our parser did not yet accept

That is a normal systems problem, but I think it is exactly the kind of problem that gets misread in agent discourse. A workflow can be conceptually valid and still be operationally brittle because its boundaries are stale.

4. Human coordination load

This is where the skepticism from priyanka’s 2026-03-11 post lands for me. Even if multiple workstreams are real, the human often still carries the most expensive parts of the workflow:

decomposing the work
deciding which stream matters more
reviewing partial outputs
merging conflicting changes
deciding what to retry and what to discard

If that burden remains too high, then the system has not really achieved delegation. It has achieved assisted supervision.

What I changed in `grokrs`

To get grokrs ... --x-search working again, I made a few targeted compatibility fixes:

Stop sending deprecated top-level search_parameters for tool-backed search
Move X search filters onto the x_search tool object
Accept newer response shapes like output_text
Make the usage parser more tolerant of current server-side tool usage payloads

After that, this worked again:

grokrs --profile dev agent --headless --approval-mode allow --x-search \
  --max-iterations 2 "Summarize what people are saying about xAI on X in one sentence."

I also reran the package test suites:

grokrs-api: 906 tests passed
grokrs-cli: 291 tests passed

I think this is the more useful lesson from the repair: multi-agent systems often degrade first at the boundaries. Not in the screenshot. Not in the prompt demo. At the boundaries.

The synthesis that seems most honest

I do not think the right conclusion is that the optimists are wrong or the skeptics are wrong.

The optimistic posts are right that parallel work is already useful. The skeptical posts are right that a system which merely claims parallelism is not enough.

The synthesis I keep coming back to is:

multiple agents are real
multiple workstreams are useful
neither is self-validating
the hard part is shifting from generation to coordination

That is why the interesting engineering work now seems to be moving toward:

isolated worktrees and workspace boundaries
explicit ownership of subtasks
event and progress visibility
parsers and clients that survive upstream churn
review and merge loops that handle partial failure well

I think that is the part of the story that matters most. “Can an agent write code?” is no longer the whole question. “Can the system around several agents make their work dependable?” is the real one.

Closing thought

Saying “we run multiple agents” is easy.

What matters is whether those agents can work in parallel without corrupting state, whether the operator can see what is happening, whether the system survives interface drift, and whether the outputs are cheap to review and integrate.

That is the line between a screenshot and an operating model.

The X discourse feels like it is converging on that distinction, even when the posts sound like they disagree. One side is seeing the leverage. The other side is seeing the fragility.

I think both are describing the same transition from different angles.

Sources

Boris Cherny https://x.com/bcherny/status/2038454353787519164
Abdulmuiz Adeyemo https://x.com/AbdMuizAdeyemo/status/2025519825691283657
Numman Ali https://x.com/nummanali/status/2019473874455331156
Julian Goldie https://x.com/JulianGoldieSEO/status/2020081836240896487
Demir Bülbüloğlu https://x.com/demirbulbuloglu/status/2025598095312982249
priyanka https://x.com/pridesai/status/2031783971047051445

3 AIs Reviewed the Same Codebase. They Disagreed on 2 Findings. That is the Point.

Verivus OSS Releases — Tue, 07 Apr 2026 05:51:55 +0000

We have a rule at Verivus Labs: before code ships, it gets reviewed by three AI models independently. We require unconditional approval from Claude, Codex, and Gemini before anything merges. We wrote about the mechanics of that process in The Codex Review Gate.

That process works well on our own code. We wanted to know whether it finds real things in code we did not write. Code that is already well-maintained and well-structured.

Simon Willison's llm is one of the better-engineered CLI tools in the Python ecosystem. It has a clean architecture, a comprehensive plugin system, and parameterized SQL throughout. The reviewers independently noted the consistent SQL safety, which speaks to the care that has gone into the project. We pointed our tools at it and filed the findings that survived review.

The setup

Two of our tools did the heavy lifting.

sqry is our AST-based code analysis tool. We wrote about it in The Code Question grep Can't Answer. It parses code structurally, building function signatures, call graphs, and dependency relationships, and exposes them through an MCP server. sqry gave the reviewers a structural map of 40 Python source files containing 5,499 symbols and 7,277 edges.

llm-cli-gateway coordinated the reviews. It is our MCP server for multi-LLM orchestration. It wraps Claude, Codex, and Gemini through a single interface with retries, circuit breakers, and session management. Each reviewer got the same prompt and the same sqry access, run in separate sessions with no shared context.

We also built an llm plugin that bridges our gateway into Simon's own llm ecosystem. Install with llm install llm-cli-gateway and you get gateway-claude, gateway-codex, and gateway-gemini as models. The plugin requires Node.js 18+ for the gateway runtime. We wanted to contribute to Simon's ecosystem.

The review target was simonw/llm at commit cad03fb, reviewed on April 4, 2026.

What they found

Codex went first. 11 minutes, 307K tokens. It used sqry to navigate the call graph, then fetched source directly from GitHub to verify against specific commits. It identified 8 potential issues.

Gemini went second. 8 minutes. It used sqry hierarchical search and pattern search. It confirmed 5 of Codex's findings and identified 3 new ones.

We then sent each reviewer's unique findings to the other for cross-validation. At this point we had 11 candidate findings, all confirmed by both Codex and Gemini.

Two reviewers is good, but three is better. Claude did an independent adjudication pass over the 11 candidates, reading each relevant source file and providing line-level verdicts. Claude's role was validation. It assessed whether each finding was a genuine defect or a defensible design choice.

Claude confirmed 8 findings. It disputed 2. It marked 1 uncertain.

The disputes taught us the most.

The 2 findings Claude rejected

Uncaught hook exceptions in async tool execution. Codex and Gemini both flagged that before_call/after_call hooks in the async path run outside try/except, meaning a buggy plugin hook crashes the entire parallel tool batch.

Claude disagreed. If an after-call hook throws, that is an unexpected error and should propagate. Silently swallowing hook failures would mask plugin bugs. The current behavior is a defensible design choice.

Memory usage with large attachments. Codex and Gemini both noted that _attachment() eagerly reads entire files into memory, base64-encodes them (33% expansion), and holds everything in a JSON object simultaneously.

Claude's assessment was that this is inherent to how multimodal API calls work. The content has to be serialized to send it. There is no unnecessary duplication. It is the minimum work required by the API contract.

Both are reasonable arguments. This is why three-way review matters. Two models agreeing does not make something a defect. The third model asking whether something is actually wrong, or just uncomfortable, prevents filing noise.

The 1 finding Claude marked uncertain

Async tool execution racing shared Toolbox state. Codex and Gemini flagged that the async path batches tool calls into asyncio.gather(), which could race if a Toolbox instance maintains state across calls. Claude's assessment was that the framework's own state management appears safe, but whether the issue manifests depends on plugin-specific behavior. The framework does not guarantee sequential execution, and plugins may not expect parallelism.

The 8 findings that held up

Three stood out.

PDF attachment data persisted in logs. The redact_data() function strips image_url.url and input_audio.data from logged prompt JSON, but has no case for file.file_data, where PDF attachments are stored as base64. Full PDF contents persist in logs.db. Users who share that database could inadvertently expose document contents. Filed as #1396.

Embedding dedup comparing wrong keys. embed_multi_with_metadata() queries by content_hash but then filters by comparing incoming item IDs against returned row IDs. These are semantically different values. Duplicate content under a new ID bypasses dedup silently. Filed as #1397.

Stale loop variable in tool logging. In log_to_db(), the tool_instances INSERT references tool.plugin from a previous loop. Python loop variables retain their last value after the loop ends, so every tool result gets attributed to whichever toolbox was last in the list. Filed as #1398.

The remaining five: a possible migration race window when multiple processes start before migrations complete (commented on #789), a potential --async --usage crash with AsyncChainResponse, negative --chain-limit failing immediately, asyncio.run() called inside running event loops, and cosine_similarity() dividing by zero on zero vectors.

Severity ratings are our internal assessment. None have been confirmed by the maintainer yet.

#	Finding	Validation	Filed
1	PDF data not stripped by `redact_data()`	3/3	#1396
2	Embedding dedup compares wrong keys	3/3	#1397
3	Possible migration race window	3/3	#789
4	Async tool races shared state	2/3	--
5	`--async --usage` crash	3/3	--
6	Stale loop variable in `log_to_db()`	3/3	#1398
7	Negative `--chain-limit` fails	3/3	--
8	`asyncio.run()` in event loop	3/3	--
9	Hook exceptions crash batch	2/3	--
10	Memory with large attachments	2/3	--
11	`cosine_similarity` / zero	3/3	--

What sqry contributed

sqry gave the reviewers structural navigation instead of text search:

find_cycles confirmed zero import cycles and one guarded call cycle (get_model calling get_async_model and vice versa)
complexity_metrics identified logs_list() at complexity 43 (622 lines) and prompt() at complexity 35 (450 lines, 30 parameters)
direct_callers and explain_code let Codex trace the full _attachment() to log_to_db() to redact_data() call path that exposed the PDF issue
pattern_search found the stale loop variable pattern across the codebase

Structural navigation means the reviewers could follow call paths and dependency chains rather than searching for keywords. That is the difference between asking "where is this function called" and actually knowing.

Try it

The llm plugin provides the simplest entry point. It routes through the MCP gateway under the hood. For structural review like we describe in this article, you would also want sqry running as an MCP server so the models can navigate call graphs.

# Install the llm plugin (requires Node.js 18+)
llm install llm-cli-gateway

# Basic usage
llm -m gateway-codex "Review this file for bugs: $(cat src/main.py)"
llm -m gateway-gemini "Review this file for bugs: $(cat src/main.py)"

# For structural review with sqry, use the MCP gateway directly
npm install -g llm-cli-gateway

Gateway: github.com/verivus-oss/llm-cli-gateway
Plugin: pypi.org/project/llm-cli-gateway
sqry: github.com/verivus-oss/sqry

What we took away

The findings we filed are candidates that survived three-way review. The maintainer may disagree with some of them. The point of the exercise was to test the methodology, and we are grateful to Simon for building llm in the open where this kind of analysis is possible.

The reviewers did not find SQL injection surfaces in the paths they inspected. The issues they found are subtle. Stale loop variables, key mismatches in dedup logic, missing cases in sanitization functions. These are the kind of things that survive human review because the code reads well.

The result that stayed with us was the disagreements. Two models confirming something does not make it true. The third model asking whether something is actually a defect is what separates useful review from noise. That is why you review with multiple perspectives.

We will keep running this pattern. Three independent perspectives catch things that one perspective misses. That is the premise behind llm-cli-gateway, and this was a useful case study.

Werner Kasselman is a software engineer who builds open source developer tools in his spare time, including sqry and llm-cli-gateway. By day he works at ServiceNow. He lives in Australia with his family and blogs at medium.com/@wernerk. Views expressed here are his own and do not represent ServiceNow.

How We Built a Safety-First Rust Agent CLI in Two Days Without Letting the Codebase Turn to Mush

Verivus OSS Releases — Tue, 07 Apr 2026 01:58:15 +0000

The short version

I think most AI-assisted software fails in one of two ways.

The first failure mode is obvious. The code is sloppy, the boundaries are fuzzy, and the whole thing feels like a transcript that got committed by accident.

The second failure mode is more subtle. The code is fine for a demo, but the repo has no durable planning model, no review trail, and no way to explain why one subsystem looks the way it does. A week later, nobody wants to touch it.

grokrs avoided both.

This repo is a Rust-only scaffold for a Grok-oriented agent CLI. It is safety-first by design. More importantly, it was built fast without taking the usual shortcuts that make a codebase hard to trust. The artifact trail shows a concentrated implementation burst across 2026-04-05 and 2026-04-06, but the result still has clear crate boundaries, deny-by-default policy handling, machine-readable planning, and a review system that is stronger than what I usually see in projects with much longer schedules.

I want to walk through why that happened, because I think the process is as interesting as the software.

I also used grokrs itself to generate the article images and submit this draft to Dev.to. That felt like a fair test of whether the CLI is already useful outside its own repository.

What was actually built

At the workspace level, grokrs is not a single crate with a heroic main.rs. The root Cargo.toml defines eight members:

grokrs-core
grokrs-cap
grokrs-policy
grokrs-session
grokrs-tool
grokrs-cli
grokrs-api
grokrs-store

That split matters. Each crate has a job. Each crate also imposes a limit.

The repository currently has 130 Rust source files under crates/, 55,963 total Rust source lines, and about 41,990 Rust code lines once blank lines and comment-only lines are stripped out. A quick scan of test annotations turns up 1,647 #[test] and #[tokio::test] markers in the crate tree. The docs side is not small either. There are 5 top-level specs in docs/specs/, and 15 IMPLEMENTATION_DAG.toml files under docs/reviews/.

This is not a toy codebase pretending to be architecture.

I reran a fresh full sqry index on 2026-04-07 because I wanted a better read on the codebase before publishing this. The new index came back with numbers that are hard to wave away:

131 indexed files
46,997 indexed symbols
61,393 graph edges
10,537 functions and 6,760 call sites
0 cycles in the current graph snapshot

The extended graph details are just as telling. sqry broke the workspace down into 24,830 variables, 1,269 imports, 1,157 types, 870 macros, 509 methods, 339 modules, 285 structs, and 81 enums. It also reported 0 cross-language edges, 4,843 duplicate groups, and 3,898 unused symbols. That is the kind of inventory I expect from a real codebase, not a weekend mockup.

For raw size, the current Rust tree under crates/ is 55,963 total source lines, or about 41,990 code lines by a simple blank-and-comment strip. The git history in this repository shows 55,888 net Rust source lines landing on 2026-04-06, so the visible implementation burst was extremely concentrated even if the project story still spans two days.

The crate breakdown is clean:

grokrs-cap carries rooted path handling and trust-level types
grokrs-policy carries effect classification and deny-by-default evaluation
grokrs-tool carries tool traits, classification, and registry logic
grokrs-api carries xAI and Grok transport, streaming, endpoints, and tool-loop code
grokrs-store carries SQLite WAL persistence
grokrs-session carries typed lifecycle state
grokrs-core carries config and shared domain types
grokrs-cli carries user-facing commands and orchestration

I think this is one of the big reasons the repo stayed readable while moving fast. Lower-level safety primitives do not depend on the CLI. The API crate gets a policy gate injected at runtime rather than importing policy code directly. The store crate stays focused on state. Each boundary removes a class of later confusion.

Why the safety model works

The top-level architecture doc says the project wants four things:

explicit trust boundaries
a rooted filesystem model
effects classified before execution
a modular implementation that can grow without a rewrite

The code follows through on that.

Trust is encoded in types. Sessions are parameterized by trust level. Path handling is rooted through WorkspaceRoot and WorkspacePath. The policy engine works in terms of explicit effects such as FsRead, FsWrite, ProcessSpawn, and NetworkConnect. The defaults are conservative. Network is denied by default. Shell spawning is denied by default. Workspace writes require validated relative paths.

That is what I want to see in an agent CLI. The system is opinionated before the first command runs.

The tool path is especially solid. In the executor flow, a tool call gets looked up, classified, policy-checked, then executed. The approval behavior is explicit too:

allow maps Ask to Allow
deny maps Ask to Deny
interactive preserves Ask, but current comments make clear that this is effectively a deny path until the approval broker is implemented

That sequence matters. It means the project did not fake the approval layer just to keep the demo flowing.

The command surface is already broad

The first spec in docs/specs/00_SPEC.md is intentionally modest. It says the initial release does not promise a production agent runtime yet. It wants to establish the boundaries needed to build one safely.

That makes the current command surface more interesting, not less.

The repo already supports:

direct API operations through grokrs api
interactive REPL chat through grokrs chat
tool-calling agent execution through grokrs agent
management work through collections
media generation through generate
model discovery through models
session and store inspection through sessions and store
runtime posture and config inspection through doctor and show_config

The architecture doc also calls out voice, MCP client support, search integration, encrypted reasoning replay, prompt caching, and memory tools. This is well past the point where you can call it a shell around one endpoint.

What I like here is the sequencing. The project did not start with a magical agent and then backfill the boring layers. It built the capability model, the policy path, the store, and the command surface together.

The real story is in the documentation system

If you only read the code, you will understand the runtime. If you read the docs tree, you understand the development method.

The visible docs include:

AGENTS.md
CLAUDE.md
ARCHITECTURE.md
docs/specs/00_SPEC.md
docs/specs/01_XAI_API_CLIENT.md
docs/specs/02_SQLITE_STORE.md
docs/specs/03_APPROVAL_BROKER.md
docs/specs/04_MCP_SERVER.md
docs/design/00_ARCHITECTURE.md
docs/design/01_SQLITE_STATE.md
docs/development/grokrs/01_SPEC.md
docs/development/grokrs/03_IMPLEMENTATION_PLAN.md
docs/development/grokrs/04_XAI_API_IMPLEMENTATION_PLAN.md
docs/development/grokrs/05_AGENT_ORCHESTRATION_PROMPT.md
docs/ops/00_BOOTSTRAP.md
docs/reviews/AI_SLOP_REVIEW_GUIDE.md

That is a full process stack. It covers scope, architecture, implementation planning, operations, and review posture.

I do not think these docs were written as decoration. They function as a control plane for the repo.

Specs as execution boundaries

The subsystem specs are already separated:

the product spec
the xAI API client spec
the SQLite store spec
the approval broker spec
the MCP server spec

That turns requirements into named contract surfaces.

In an AI-assisted environment, that is a bigger deal than people often admit. If you want parallel work to stay coherent, you need somewhere more durable than a chat thread to define the intended behavior of a subsystem. These spec docs do that job.

Review artifacts as first-class outputs

The review tree is even more revealing.

Under docs/reviews/, the repo includes named review domains for bootstrap, approval broker, batch extensions, collections management, document search, MCP server, remote MCP tools, responses enrichment, security hardening, SQLite store, TTS API, xAI API client, clippy pedantic cleanup, competitive features, and competitive gap analysis.

The bootstrap bundle on 2026-04-05 contains 6 files:

CONTRACT_DECLARATION.toml
EVIDENCE_MATRIX.toml
IMPLEMENTATION_DAG.toml
README.md
REVIEW_READINESS.toml
TRACEABILITY.toml

That combination is doing serious work.

CONTRACT_DECLARATION.toml states the promise.
IMPLEMENTATION_DAG.toml structures the work.
TRACEABILITY.toml ties implementation back to intent.
EVIDENCE_MATRIX.toml says what proof should exist.
REVIEW_READINESS.toml says when the artifact set is actually inspectable.

I think that is the right abstraction. Review is not an afterthought at the end of coding. Reviewability is part of the deliverable.

Why `IMPLEMENTATION_DAG.toml` mattered so much

Plenty of teams keep a task list. That is not the same thing.

A DAG tells you which work units can move in parallel, which ones are blocked, and where integration risk sits. That is exactly what you need when multiple agents or reviewers are touching the same repo.

This repo has 15 implementation DAG files under docs/reviews/. That tells me the DAG pattern was not a bootstrap stunt. It became part of the operating model.

I think the big benefits are pretty concrete:

parallel work can move without guesswork about order
write scopes stay smaller
review can happen against a declared node instead of a vague feature story
reintegration gets easier because dependencies stay visible

That last point matters more than people think. AI agents are good at filling in local structure. They are not naturally good at keeping a whole repo’s execution order in their head unless you give them an artifact that does that for them.

This repo used AI agents and kept its shape

The top-level environment gives away a lot:

.aivcs
.claude
.sqry

The adjacent dag-toml-templates repo adds .continue, .factory, .windsurf, and .agents.

That is an AI-native build environment. I think that is obvious.

What is not obvious, and what is worth learning from, is that the repo did not let the agent workflow become the architecture.

The architecture stayed in crates.
The intent stayed in specs.
The execution order stayed in DAGs.
The proof stayed in evidence and traceability artifacts.

That changes the role of the model. The model is not only there to generate code. It is expected to leave behind planning state, review state, and proof state too.

That is a much healthier arrangement.

`sqry` was a good fit for this codebase

I checked the semantic index while reviewing the repo. It exposed 102 files and 33,260 indexed symbols in the grokrs workspace.

That matters because simple text search is not enough for some of the questions you actually want to ask in a codebase like this:

where does policy gating happen
which commands route through the same transport bridge
which tools are exposed to the model
where is session state persisted
which tests cover a given path

This repo has enough structure that semantic navigation pays for itself. The DAG and review system also make semantic tooling more useful because the work is already decomposed into named slices.

The most interesting adjacent repo is `dag-toml-templates`

If grokrs shows the current operating model, /srv/repos/internal/verivusai-labs/dag-toml-templates shows where that model is going.

Its README.md still presents a canonical versioned release surface for template packages. That matters. The file-based layer is not being discarded.

At the same time, the research and design docs are explicit about the next move.

research/DATABASE_REPLACEMENT_RESEARCH.md frames the problem as TOML DAG Templates → Structured Database. It evaluates database candidates for the three process-control packages.

The final ranking puts SurrealDB 3.0 first with 100/130. The recommended architecture keeps database state in SurrealDB while exporting and importing through aivcs.

Then docs/superpowers/specs/2026-04-06-v2-surrealdb-adoption-design.md makes the transition explicit. It defines v2 as a SurrealDB-Backed Hybrid Template Pack. The important word there is Hybrid.

I think that is the correct direction.

The point is not to throw away TOML. The point is to stop asking static files to act like a live workflow database.

What the `dagdb` package already proves

The implementation under src/dagdb/ is not theoretical.

migrate.py includes:

detect_toml_type()
import_toml_file()
import_dag()
import_traceability()
import_review_readiness()
export_dag()

That means the system can ingest the three major TOML package families into a database model and, for DAGs, reconstruct TOML-compatible output on the way back out.

history.py adds unit_history and edge_history, plus get_dag_state_at() for point-in-time reconstruction.

invariants.py classifies which invariants can live in the database and which still need application code.

schema_migration.py adds schema_migrations, apply and rollback behavior, and explicit migration-state handling.

This is the part I find most persuasive. The move from TOML to structured state is not being described only in prose. It is being built as code.

The prototype results are honest

The prototype evaluation is one of the better research-to-implementation bridges I have seen in a repo like this.

research/SURREALDB_PROTOTYPE_EVALUATION.md reports:

34 total tables and edge tables across DAG, traceability, and review-readiness data
13 of 29 validator checks classified as db_enforced
9 of 29 classified as query_checkable
7 of 29 classified as app_required
75.9% combined DB-plus-query coverage
347 collected tests
343 passes
4 xfails tied to time-travel support

The most useful finding in that report is the limitation section. The VERSION clause is accepted syntactically. It does not actually perform historical time travel in the tested embedded setup. The repo says that plainly and uses explicit history tables as the workaround.

I trust systems more when they write down their failed assumptions.

The invariant classification is just as good. The repo does not pretend a database will magically solve graph algorithms. It says computed values like entry points, leaf nodes, critical path, and maximum parallelism still belong in application code.

That is the split I would want too:

database for durable state, relations, audit, and constraints
application for graph algorithms and orchestration logic

What dev.to readers can copy from this repo

I do not think most teams need this exact stack. I do think more teams should steal the shape of it.

If you are building an AI-heavy internal tool, these are the pieces I would copy first.

1. Put the safety model in types

Do not leave trust, path safety, and effect handling as loose runtime conventions. Make them visible in the type system and module boundaries.

2. Write subsystem specs before parallel AI work starts

You do not need long specs. You do need named contract surfaces. A short subsystem spec beats a hundred lines of prompt history.

3. Use a DAG when work will be parallel

A task list is fine for one human. A DAG is better when several workers, human or model, are moving at once.

4. Require proof artifacts, not just diffs

This repo’s contract, evidence, traceability, and readiness files are doing something very practical. They force work to explain itself.

5. Keep a human-reviewable export surface

The move in dag-toml-templates is not file versus database. It is file plus database. I think that is the right model for most engineering systems with both human review and live workflow state.

What I think is the real lesson

The interesting result here is not that AI models can write a lot of code quickly. We already know that.

The interesting result is that a repo can move quickly, use multiple agents, accumulate real functionality, and still stay reviewable if the team is strict about where meaning lives.

In grokrs, meaning lives in:

the crate graph
the spec docs
the implementation DAGs
the evidence and traceability artifacts
the policy model

That is why the repo still feels like engineering.

Evidence note

I grounded this article in repository-visible evidence available on 2026-04-06, including the root workspace manifest, crate layout, architecture and spec docs, dated review artifacts, command and module structure, semantic index results, and the adjacent dag-toml-templates design and research documents.

Some claims about the exact outer AI/VCS orchestration layer remain inference rather than direct confirmation, because those metadata surfaces are suggestive but not fully self-describing on their own.

How We Used AI Agents to Security-Audit an Open Source Project

Verivus OSS Releases — Sun, 05 Apr 2026 07:30:20 +0000

Using sqry's code graph, parallel audit agents, and iterative Codex review to contribute security improvements to gstack.

Garry Tan open-sourced gstack on March 11, 2026. It is a CLI toolkit for Claude Code with a headless browser, Chrome extension, skill system, and telemetry layer. The project attracted 30+ PR authors within its first few weeks.

We wanted to contribute something useful. Security review seemed like the right fit. A headless browser that spawns subprocesses and handles cookies has a large attack surface, and security work tends to fall to the bottom of every fast-moving project's priority list.

If you haven't read our earlier posts: sqry is an AST-based code search tool. It parses code like a compiler, building a graph of functions, classes, imports, and call relationships across 35+ languages. llm-cli-gateway orchestrates multiple LLMs (Claude, Codex, Gemini) through a single MCP interface. The Codex review gate is our practice of requiring unconditional Codex approval before shipping.

The Codebase

At the time of our audit (late March 2026, against the main branch as of March 30), gstack had about 47,000 symbols across 212 files in TypeScript, JavaScript, HTML, CSS, Shell, Ruby, JSON, and SQL. The browse subsystem's handleWriteCommand function was roughly 715 lines with a complexity score of 58. The Chrome extension injects into every page the user visits. The sidebar agent spawns Claude subprocesses from a JSONL queue file.

Running grep "exec" on this codebase returns 60+ matches. None of them look obviously wrong. Security review requires understanding relationships between functions, not just finding keywords.

Why grep Falls Short

In a previous article, I described why structural code search matters for this kind of work.

Say you want to find every path from user input to a dangerous sink like Bun.spawn(). grep finds the spawn calls. It does not tell you which functions call those functions, which HTTP endpoints call those functions, or whether any validation sits between the endpoint and the spawn.

sqry made this practical. For gstack, it built a graph of 46,837 nodes and 39,083 edges in 280ms. With all 36 language plugins enabled (including high-cost plugins like JSON and ServiceNow XML), the full graph captures 55,365 raw edges across 212 files.

sqry index . --force --include-high-cost

Files indexed:  212
Symbols:        46,837
Edges:          39,083 canonical (55,365 raw)
Plugins:        36 active
Build time:     280ms

Round 1: 10 Findings, 3 LLMs

Our first audit in March used three LLMs in separate roles. Claude and Codex each independently found overlapping but non-identical sets of issues. Gemini then verified all findings against source code. The total was 10 unique security findings across gstack's browse server, Chrome extension, design CLI, and telemetry layer. We submitted PR #664 with fixes and filed 10 public security issues (#665-#670, #672-#675). We disclosed publicly because gstack is a developer tool running locally, not a production service handling user data — the risk profile favors transparency over coordinated disclosure.

What gave us confidence these were real: three other contributors (stedfn, Gonzih, and mehmoodosman) independently found at least 6 of the same issues through separate analysis. Based on the public timeline, their PRs were filed after our issues and showed no references to our reports, suggesting independent discovery. Convergence from different methods and different people is strong validation.

Round 2: 20 More Findings

For the second audit, we expanded the approach. We dispatched 4 parallel audit agents instead of manually querying sqry:

Agent 1: server.ts, covering HTTP endpoints, auth, and CORS
Agent 2: write-commands.ts, the highest-complexity function, covering file ops and cookie handling
Agent 3: meta-commands.ts, covering command parsing, state management, and frame targeting
Agent 4: extension/, covering the Chrome extension sidepanel, inspector, and background worker

Each agent had full sqry MCP access with instructions to look for issues beyond the 10 we had already reported. They returned 25 raw findings. After cross-referencing against 20+ existing community issues and the maintainer's own security work (he had already landed two security-focused PRs), 16 were new. Four more gaps turned up during implementation review. The severity classifications below are ours, based on our assessment of impact and prerequisites — the maintainer may classify them differently.

A Subtle but Serious Finding

# bin/gstack-learnings-search, lines 46-52
cat "${FILES[@]}" 2>/dev/null | bun -e "
const type = '${TYPE}';
const query = '${QUERY}'.toLowerCase();
const limit = ${LIMIT};
const slug = '${SLUG}';

Bash variables are interpolated directly into JavaScript string literals via bun -e. A branch name containing a single quote, like fix'; process.exit(1); //, would break out of the JS string and execute arbitrary code. Easy to write, hard to spot in review.

The fix: pass parameters via environment variables instead of string interpolation.

cat "${FILES[@]}" 2>/dev/null | \
  GSTACK_FILTER_TYPE="$TYPE" \
  GSTACK_FILTER_QUERY="$QUERY" \
  GSTACK_FILTER_LIMIT="$LIMIT" \
  bun -e "
const type = process.env.GSTACK_FILTER_TYPE || '';
const query = (process.env.GSTACK_FILTER_QUERY || '').toLowerCase();
const limit = parseInt(process.env.GSTACK_FILTER_LIMIT || '10', 10) || 10;

Environment variables are never interpreted as code. The injection vector disappears.

A Finding sqry Made Possible

sqry's find_cycles tool detected a mutual recursion between switchChatTab and pollChat in the Chrome extension's sidepanel:

switchChatTab -> pollChat -> switchChatTab (cycle depth: 2)

pollChat fetches the server's active tab ID. If it differs from the client's, it calls switchChatTab. switchChatTab sets state and immediately calls pollChat. If the server keeps returning a different tab ID during rapid switching, this creates unbounded stack recursion.

grep alone will not reveal this relationship. The bug lives in the interaction between two functions, and that interaction only becomes visible in the call graph.

The Full List

We classified findings on a four-level scale: HIGH means an attacker can execute arbitrary code or exfiltrate data with minimal prerequisites. MED-HIGH means significant impact but requiring local access or a specific precondition. MED means the issue requires local access, specific conditions, or produces limited impact. LOW covers hardening gaps and defense-in-depth improvements.

#	Severity	Finding
1	HIGH	Shell injection via bash-to-JS interpolation
2	MED-HIGH	Queue file permissions allow local prompt injection
3	MED	`/health` endpoint exposes user activity without auth
4	MED	ReDoS via `new RegExp(userInput)` in frame targeting
5	MED	`chain` command bypasses watch-mode write guard
6	MED	`cookie-import` allows cross-domain cookie planting
7	MED	CSS values unvalidated at 4 injection points
8	MED	Session directory traversal via crafted `active.json`
9	MED	`responsive` screenshots skip path validation
10	MED	`validateOutputPath` uses `path.resolve`, not `realpathSync`*
11	MED	`state load` navigates to unvalidated URLs
12	MED	DOM serialization round-trip enables XSS on tab switch
13	MED	`switchChatTab`/`pollChat` mutual recursion
14	MED	`cookie-import-browser --domain` accepts unvalidated input
15-20	LOW	Info disclosure, timeout handling, bounds validation, prompt injection surface

*Finding 10 is a common pattern worth highlighting:

// BEFORE: resolves logically, symlinks pass through
const resolved = path.resolve(filePath);  // /tmp/safe -> still "/tmp/safe"

// AFTER: resolves physically, symlinks followed to real target
const resolved = realpathSync(filePath);  // /tmp/safe -> "/etc/shadow" (blocked!)

A symlink at /tmp/safe pointing to /etc would pass path.resolve validation but fail realpathSync, because the real path is outside the safe directory.

The Codex Review Gate

In a previous article, I described how we use Codex as a mandatory review gate. Unconditional approval or the work does not ship. Codex earned this role through specificity. Where a generic reviewer might say "consider improving error handling," Codex pinpoints "the catch block on line 47 swallows errors silently." It also has a low false-positive rate, which keeps the gate credible over time.

For this security plan, Codex went through 9 rounds before approving. That says more about our work than the tool. Three examples of what it caught:

Round 2: Our queue validator used string for tabId when the actual writer emits number. A type mismatch that would have caused the validator to reject every real queue entry.
Round 5: null values (which the real writer produces for optional fields) would be rejected by our schema. The validator was correct in theory but wrong against the actual data format.
Round 8: Our test extracted a 1500-character slice from the source file to validate against. That slice bled into adjacent functions, meaning the test could pass even without the fix being applied. The final solution: a brace-walking function body extractor that isolates exactly the target function.

Each round made the plan more precise. The full 9-round breakdown is in the PR #806 discussion. The discipline of submitting to review — and actually fixing what is found — is where the quality comes from.

Implementation: Subagent-Driven Development

With an approved plan, we dispatched one implementation subagent per task, 18 tasks total. Each subagent:

Read the specific source files
Created failing tests
Implemented the fix
Verified tests pass
Committed

A mid-implementation code review by a separate review agent caught 4 additional gaps we had missed:

applyStyle in the extension was missing the same CSS validation added to 3 other injection points
snapshot.ts still used the old path.resolve pattern
stateFile in queue entries had no path traversal check
cookie-import's read path validation used the old pattern

All fixed before continuing. That is why you review.

Test Results

Security regression tests: 119 pass, 0 fail [47ms]
E2E evals (Docker + Chromium): 33 pass, 0 regressions
Previously-failing browse tests: all 3 now pass

The E2E evals ran inside a Docker container (Ubuntu 24.04, Chromium 145, Playwright 1.58.2, --cap-add SYS_ADMIN for the Chromium sandbox). One test outside the security suite (qa-bootstrap) failed due to test infrastructure — it is not included in the 33 count above.

How It Landed

On April 6, the maintainer cherry-picked both our first round (PR #664) and second round (PR #806) onto the garrytan/security-wave-5 branch with co-author credit. They are part of PR #847, which bundles fixes from 8 community PRs across 4 contributors. That PR is open and under review at time of writing.

This did not happen immediately. On April 5, the maintainer merged PR #810 ("security wave 1"), which cherry-picked fixes from Gonzih and garagon — contributors who had independently found several of the same issues we reported in our round 1 issues (#665-#670, #672-#675), filed on March 30. At that point our PRs were still open without comment.

We flagged four gaps in that initial wave:

validateOutputPath was only fixed in one of three copies. The identical vulnerable function in meta-commands.ts and inline validation in snapshot.ts still used path.resolve without realpathSync.
The fix broke on macOS. SAFE_DIRECTORIES contained /tmp, but on macOS /tmp is a symlink to /private/tmp. realpathSync resolves through it, causing legitimate screenshots to be rejected.
No queue entry schema validation. File permissions were added, but queue entry contents were not validated against type checks or path traversal.
/health still leaked user activity. The unauthenticated response returned the user's current URL and sidebar AI message text.

All four gaps are addressed in the security wave 5 PR. The maintainer included garagon's #820 (symlink resolution in meta-commands), our queue validation and /health fixes from #806, and the full set of CSS injection guards, cookie domain validation, reentrancy guards, and SIGKILL escalation across both our rounds.

The PR summary lists 20 security fixes with 750+ lines of new regression tests, attributed jointly to "@mr-k-man, @garagon." Most of those 20 fixes came from our two PRs (#664 and #806). garagon contributed three — shell injection env vars (#819), meta-commands symlink resolution (#820), and upload path validation (#821) — two of which address issues we originally reported. The commit history in #847 shows separate cherry-picks for each source PR.

The timeline is common in open source security work. We filed issues and PRs on March 30. Other contributors independently found overlapping issues. The maintainer triaged and cherry-picked fixes in waves over 7 days, starting with the most urgent. Our work was picked up last but included completely, with co-author attribution. Open source security work often lands asynchronously and in waves. Thorough reports with working patches tend to get recognized, even when the initial response is silence.

The Toolkit

Everything described here uses two open-source tools:

sqry: AST-based semantic code search. Builds a graph of symbols and relationships across 35+ languages. Exposes 34 MCP tools for AI agents to navigate code structurally.

llm-cli-gateway: Multi-LLM orchestration via MCP. Routes requests through Claude, Codex, and Gemini with session continuity, async job management, and approval gates.

Both are MIT-licensed. sqry runs entirely locally. llm-cli-gateway runs locally but routes requests to remote LLM APIs (Claude, Codex, Gemini).

What We Learned

Independent convergence validates methodology. When other contributors find the same issues through completely different methods, you can trust the results.

Rigorous review improves your own work most of all. 9 rounds of Codex review sounds like a lot. It was. Every round caught something real. The discipline of submitting to review, and actually fixing what is found, is where the quality comes from.

Structural search finds what text search misses. The switchChatTab/pollChat recursion, the validateOutputPath symlink bypass, the CSS injection across 4 separate code paths — these are relationship issues. Understanding code structure is different from searching code text.

Security review is a good way to serve the open source community. Every maintainer has more feature requests than they can handle. A thorough security review with fixes, tests, and documentation is work that helps everyone who uses the project. We are grateful gstack is open source and that we could contribute.

The full security audit report, implementation plan, and all test results are in PR #806. The round 1 report is in PR #664.

sqry: github.com/verivus-oss/sqry
llm-cli-gateway: github.com/verivus-oss/llm-cli-gateway

How to Set Up Multi-LLM Code Review with Claude, Codex, and Gemini

Verivus OSS Releases — Tue, 31 Mar 2026 02:55:56 +0000

Every LLM has blind spots. Claude is strong on architecture and design patterns. Codex catches logic bugs and missing error handling. Gemini is thorough on security issues and edge cases. Using just one reviewer means you are only getting one perspective.

This tutorial walks through setting up llm-cli-gateway -- an MCP server that wraps the Claude Code, Codex, and Gemini CLIs -- and running a parallel code review that combines all three.

Prerequisites

You need the CLI tools installed:

# Claude Code
npm install -g @anthropic-ai/claude-code

# Codex
npm install -g @openai/codex
codex login

# Gemini
npm install -g @google/gemini-cli

You do not need all three. The gateway works with whichever CLIs you have installed.

Step 1: Install the Gateway

Add it to your MCP client configuration. If you use Claude Code, edit ~/.claude/settings.json:

{
  "mcpServers": {
    "llm-gateway": {
      "command": "npx",
      "args": ["-y", "llm-cli-gateway"]
    }
  }
}

That is the entire setup. The gateway discovers your installed CLIs automatically via PATH resolution (including ~/.local/bin and NVM paths).

Step 2: Verify Your Setup

Once connected, confirm which CLIs are available:

list_models()

This returns the available models for each detected CLI. If a CLI is not installed, it will not appear in the output.

Step 3: Run a Parallel Code Review

Here is the core workflow. You send the same codebase to all three LLMs, each with a prompt tuned to its strengths.

Claude -- Architecture and Quality

claude_request({
  "prompt": "Review the changes in src/auth/ for architecture, design patterns, maintainability, and documentation gaps. Read the files directly. Provide specific line numbers and suggested fixes.",
  "optimizePrompt": true,
  "optimizeResponse": true
})

Codex -- Logic and Correctness

codex_request({
  "prompt": "Analyze src/auth/ for logic bugs, off-by-one errors, missing error handling, race conditions, and test coverage gaps. Read the files directly. Rate each finding: critical, high, medium, or low.",
  "fullAuto": true,
  "optimizePrompt": true,
  "optimizeResponse": true
})

Gemini -- Security and Edge Cases

gemini_request({
  "prompt": "Security audit of src/auth/: check for injection vulnerabilities, authentication bypasses, data leaks, OWASP Top 10 violations, and crash-causing edge cases. Read the files directly.",
  "model": "gemini-2.5-pro",
  "optimizePrompt": true,
  "optimizeResponse": true
})

In an MCP client like Claude Code, you can fire all three of these as parallel tool calls in a single turn.

Step 4: Handle Long-Running Reviews

Code reviews on large files can take over a minute. The gateway handles this transparently.

Any sync request that exceeds 45 seconds automatically becomes an async job. Instead of timing out, you get back a job reference:

{
  "status": "deferred",
  "jobId": "abc-123",
  "message": "Running in background. Poll with llm_job_status."
}

Check on it:

llm_job_status({ "jobId": "abc-123" })

When status is completed, fetch the result:

llm_job_result({ "jobId": "abc-123" })

If a review is stuck, cancel it:

llm_job_cancel({ "jobId": "abc-123" })

Step 5: Synthesize the Results

Once all three reviews come back, combine them. Here is a structured approach:

Deduplicate. Multiple LLMs will often flag the same issue. Merge these and note which LLMs agreed.
Prioritize. Critical findings first, then high, medium, low. If two or more LLMs flag the same thing as critical, it almost certainly is.
Cross-validate unique findings. When only one LLM finds something, verify it. Gemini-only security findings are usually real. Single-LLM style complaints are usually noise.
Categorize. Group by Security, Correctness, Performance, and Maintainability.

The output should look like:

## Code Review Summary

### Critical (must fix)
- SQL injection in login handler (line 47) -- found by Gemini, confirmed by Codex

### High
- Missing error handling on token refresh (line 112) -- found by Codex
- Session fixation vulnerability (line 89) -- found by Gemini

### Medium
- Duplicated validation logic across handlers -- found by Claude
- No rate limiting on auth endpoints -- found by Gemini, noted by Claude

Step 6: Fix and Verify

Send the consolidated findings back through Codex for fixes:

codex_request({
  "prompt": "Fix the following issues in src/auth/:\n\n1. [Critical] SQL injection in login handler, line 47 - use parameterized queries\n2. [High] Missing error handling on token refresh, line 112\n3. [High] Session fixation vulnerability, line 89 - regenerate session on login\n\nApply fixes and update tests.",
  "fullAuto": true,
  "optimizePrompt": true
})

Then run your test suite. If tests pass, you have a review cycle that caught issues no single LLM would have found alone.

Using Sessions for Multi-Turn Reviews

For larger reviews that require back-and-forth, create sessions:

session_create({
  "cli": "claude",
  "description": "Auth module review",
  "setAsActive": true
})

Subsequent claude_request calls with continueSession: true will use the Claude CLI's --continue flag, maintaining real conversation context. Gemini sessions use --resume for the same effect.

claude_request({
  "prompt": "Look at the token refresh logic more carefully. Is the retry backoff correct?",
  "continueSession": true
})

Optional: Approval Gates

For high-risk operations, enable approval gates:

codex_request({
  "prompt": "Refactor the authentication module",
  "fullAuto": true,
  "approvalStrategy": "mcp_managed",
  "approvalPolicy": "strict"
})

The gateway scores the operation's risk and records an approval decision before execution. Review past decisions with approval_list().

What This Is (and Is Not)

llm-cli-gateway wraps CLI binaries, not APIs. It spawns claude, codex, and gemini as child processes. You get the full CLI experience -- tool use, sandboxing, file access, your existing authentication and billing. There is no API key to configure for the gateway itself.

This means it does not work like LiteLLM or other API proxy tools. It cannot run in a cloud environment without the CLIs installed. It is designed for local development machines where you already have these tools.

"Without consultation, plans are frustrated, but with many counselors they succeed." -- Proverbs 15:22

DEV Community: Verivus OSS Releases

Multiple Agents, Multiple Workstreams, and the Parts That Still Break

Multiple Agents, Multiple Workstreams, and the Parts That Still Break

The weak version of the debate is already over

What actually seems to break first

1. Isolation

2. Visibility

3. Protocol drift

4. Human coordination load

What I changed in grokrs

The synthesis that seems most honest

Closing thought

Sources

3 AIs Reviewed the Same Codebase. They Disagreed on 2 Findings. That is the Point.

The setup

What they found

The 2 findings Claude rejected

The 1 finding Claude marked uncertain

The 8 findings that held up

What sqry contributed

Try it

What we took away

How We Built a Safety-First Rust Agent CLI in Two Days Without Letting the Codebase Turn to Mush

The short version

What was actually built

Why the safety model works

The command surface is already broad

The real story is in the documentation system

Specs as execution boundaries

Review artifacts as first-class outputs

Why IMPLEMENTATION_DAG.toml mattered so much

This repo used AI agents and kept its shape

sqry was a good fit for this codebase

The most interesting adjacent repo is dag-toml-templates

What the dagdb package already proves

The prototype results are honest

What dev.to readers can copy from this repo

1. Put the safety model in types

2. Write subsystem specs before parallel AI work starts

3. Use a DAG when work will be parallel

4. Require proof artifacts, not just diffs

5. Keep a human-reviewable export surface

What I think is the real lesson

Evidence note

How We Used AI Agents to Security-Audit an Open Source Project

The Codebase

Why grep Falls Short

Round 1: 10 Findings, 3 LLMs

Round 2: 20 More Findings

A Subtle but Serious Finding

A Finding sqry Made Possible

The Full List

The Codex Review Gate

Implementation: Subagent-Driven Development

Test Results

How It Landed

The Toolkit

What We Learned

How to Set Up Multi-LLM Code Review with Claude, Codex, and Gemini

Prerequisites

Step 1: Install the Gateway

Step 2: Verify Your Setup

Step 3: Run a Parallel Code Review

Claude -- Architecture and Quality

Codex -- Logic and Correctness

Gemini -- Security and Edge Cases

Step 4: Handle Long-Running Reviews

Step 5: Synthesize the Results

Step 6: Fix and Verify

Using Sessions for Multi-Turn Reviews

Optional: Approval Gates

What This Is (and Is Not)

Links

What I changed in `grokrs`

Why `IMPLEMENTATION_DAG.toml` mattered so much

`sqry` was a good fit for this codebase

The most interesting adjacent repo is `dag-toml-templates`

What the `dagdb` package already proves