Verivus OSS Releases

Posted on Apr 7

How We Built a Safety-First Rust Agent CLI in Two Days Without Letting the Codebase Turn to Mush

#ai #rust #cli #security

The short version

I think most AI-assisted software fails in one of two ways.

The first failure mode is obvious. The code is sloppy, the boundaries are fuzzy, and the whole thing feels like a transcript that got committed by accident.

The second failure mode is more subtle. The code is fine for a demo, but the repo has no durable planning model, no review trail, and no way to explain why one subsystem looks the way it does. A week later, nobody wants to touch it.

grokrs avoided both.

This repo is a Rust-only scaffold for a Grok-oriented agent CLI. It is safety-first by design. More importantly, it was built fast without taking the usual shortcuts that make a codebase hard to trust. The artifact trail shows a concentrated implementation burst across 2026-04-05 and 2026-04-06, but the result still has clear crate boundaries, deny-by-default policy handling, machine-readable planning, and a review system that is stronger than what I usually see in projects with much longer schedules.

I want to walk through why that happened, because I think the process is as interesting as the software.

I also used grokrs itself to generate the article images and submit this draft to Dev.to. That felt like a fair test of whether the CLI is already useful outside its own repository.

What was actually built

At the workspace level, grokrs is not a single crate with a heroic main.rs. The root Cargo.toml defines eight members:

grokrs-core
grokrs-cap
grokrs-policy
grokrs-session
grokrs-tool
grokrs-cli
grokrs-api
grokrs-store

That split matters. Each crate has a job. Each crate also imposes a limit.

The repository currently has 130 Rust source files under crates/, 55,963 total Rust source lines, and about 41,990 Rust code lines once blank lines and comment-only lines are stripped out. A quick scan of test annotations turns up 1,647 #[test] and #[tokio::test] markers in the crate tree. The docs side is not small either. There are 5 top-level specs in docs/specs/, and 15 IMPLEMENTATION_DAG.toml files under docs/reviews/.

This is not a toy codebase pretending to be architecture.

I reran a fresh full sqry index on 2026-04-07 because I wanted a better read on the codebase before publishing this. The new index came back with numbers that are hard to wave away:

131 indexed files
46,997 indexed symbols
61,393 graph edges
10,537 functions and 6,760 call sites
0 cycles in the current graph snapshot

The extended graph details are just as telling. sqry broke the workspace down into 24,830 variables, 1,269 imports, 1,157 types, 870 macros, 509 methods, 339 modules, 285 structs, and 81 enums. It also reported 0 cross-language edges, 4,843 duplicate groups, and 3,898 unused symbols. That is the kind of inventory I expect from a real codebase, not a weekend mockup.

For raw size, the current Rust tree under crates/ is 55,963 total source lines, or about 41,990 code lines by a simple blank-and-comment strip. The git history in this repository shows 55,888 net Rust source lines landing on 2026-04-06, so the visible implementation burst was extremely concentrated even if the project story still spans two days.

The crate breakdown is clean:

grokrs-cap carries rooted path handling and trust-level types
grokrs-policy carries effect classification and deny-by-default evaluation
grokrs-tool carries tool traits, classification, and registry logic
grokrs-api carries xAI and Grok transport, streaming, endpoints, and tool-loop code
grokrs-store carries SQLite WAL persistence
grokrs-session carries typed lifecycle state
grokrs-core carries config and shared domain types
grokrs-cli carries user-facing commands and orchestration

I think this is one of the big reasons the repo stayed readable while moving fast. Lower-level safety primitives do not depend on the CLI. The API crate gets a policy gate injected at runtime rather than importing policy code directly. The store crate stays focused on state. Each boundary removes a class of later confusion.

Why the safety model works

The top-level architecture doc says the project wants four things:

explicit trust boundaries
a rooted filesystem model
effects classified before execution
a modular implementation that can grow without a rewrite

The code follows through on that.

Trust is encoded in types. Sessions are parameterized by trust level. Path handling is rooted through WorkspaceRoot and WorkspacePath. The policy engine works in terms of explicit effects such as FsRead, FsWrite, ProcessSpawn, and NetworkConnect. The defaults are conservative. Network is denied by default. Shell spawning is denied by default. Workspace writes require validated relative paths.

That is what I want to see in an agent CLI. The system is opinionated before the first command runs.

The tool path is especially solid. In the executor flow, a tool call gets looked up, classified, policy-checked, then executed. The approval behavior is explicit too:

allow maps Ask to Allow
deny maps Ask to Deny
interactive preserves Ask, but current comments make clear that this is effectively a deny path until the approval broker is implemented

That sequence matters. It means the project did not fake the approval layer just to keep the demo flowing.

The command surface is already broad

The first spec in docs/specs/00_SPEC.md is intentionally modest. It says the initial release does not promise a production agent runtime yet. It wants to establish the boundaries needed to build one safely.

That makes the current command surface more interesting, not less.

The repo already supports:

direct API operations through grokrs api
interactive REPL chat through grokrs chat
tool-calling agent execution through grokrs agent
management work through collections
media generation through generate
model discovery through models
session and store inspection through sessions and store
runtime posture and config inspection through doctor and show_config

The architecture doc also calls out voice, MCP client support, search integration, encrypted reasoning replay, prompt caching, and memory tools. This is well past the point where you can call it a shell around one endpoint.

What I like here is the sequencing. The project did not start with a magical agent and then backfill the boring layers. It built the capability model, the policy path, the store, and the command surface together.

The real story is in the documentation system

If you only read the code, you will understand the runtime. If you read the docs tree, you understand the development method.

The visible docs include:

AGENTS.md
CLAUDE.md
ARCHITECTURE.md
docs/specs/00_SPEC.md
docs/specs/01_XAI_API_CLIENT.md
docs/specs/02_SQLITE_STORE.md
docs/specs/03_APPROVAL_BROKER.md
docs/specs/04_MCP_SERVER.md
docs/design/00_ARCHITECTURE.md
docs/design/01_SQLITE_STATE.md
docs/development/grokrs/01_SPEC.md
docs/development/grokrs/03_IMPLEMENTATION_PLAN.md
docs/development/grokrs/04_XAI_API_IMPLEMENTATION_PLAN.md
docs/development/grokrs/05_AGENT_ORCHESTRATION_PROMPT.md
docs/ops/00_BOOTSTRAP.md
docs/reviews/AI_SLOP_REVIEW_GUIDE.md

That is a full process stack. It covers scope, architecture, implementation planning, operations, and review posture.

I do not think these docs were written as decoration. They function as a control plane for the repo.

Specs as execution boundaries

The subsystem specs are already separated:

the product spec
the xAI API client spec
the SQLite store spec
the approval broker spec
the MCP server spec

That turns requirements into named contract surfaces.

In an AI-assisted environment, that is a bigger deal than people often admit. If you want parallel work to stay coherent, you need somewhere more durable than a chat thread to define the intended behavior of a subsystem. These spec docs do that job.

Review artifacts as first-class outputs

The review tree is even more revealing.

Under docs/reviews/, the repo includes named review domains for bootstrap, approval broker, batch extensions, collections management, document search, MCP server, remote MCP tools, responses enrichment, security hardening, SQLite store, TTS API, xAI API client, clippy pedantic cleanup, competitive features, and competitive gap analysis.

The bootstrap bundle on 2026-04-05 contains 6 files:

CONTRACT_DECLARATION.toml
EVIDENCE_MATRIX.toml
IMPLEMENTATION_DAG.toml
README.md
REVIEW_READINESS.toml
TRACEABILITY.toml

That combination is doing serious work.

CONTRACT_DECLARATION.toml states the promise.
IMPLEMENTATION_DAG.toml structures the work.
TRACEABILITY.toml ties implementation back to intent.
EVIDENCE_MATRIX.toml says what proof should exist.
REVIEW_READINESS.toml says when the artifact set is actually inspectable.

I think that is the right abstraction. Review is not an afterthought at the end of coding. Reviewability is part of the deliverable.

Why `IMPLEMENTATION_DAG.toml` mattered so much

Plenty of teams keep a task list. That is not the same thing.

A DAG tells you which work units can move in parallel, which ones are blocked, and where integration risk sits. That is exactly what you need when multiple agents or reviewers are touching the same repo.

This repo has 15 implementation DAG files under docs/reviews/. That tells me the DAG pattern was not a bootstrap stunt. It became part of the operating model.

I think the big benefits are pretty concrete:

parallel work can move without guesswork about order
write scopes stay smaller
review can happen against a declared node instead of a vague feature story
reintegration gets easier because dependencies stay visible

That last point matters more than people think. AI agents are good at filling in local structure. They are not naturally good at keeping a whole repo’s execution order in their head unless you give them an artifact that does that for them.

This repo used AI agents and kept its shape

The top-level environment gives away a lot:

.aivcs
.claude
.sqry

The adjacent dag-toml-templates repo adds .continue, .factory, .windsurf, and .agents.

That is an AI-native build environment. I think that is obvious.

What is not obvious, and what is worth learning from, is that the repo did not let the agent workflow become the architecture.

The architecture stayed in crates.
The intent stayed in specs.
The execution order stayed in DAGs.
The proof stayed in evidence and traceability artifacts.

That changes the role of the model. The model is not only there to generate code. It is expected to leave behind planning state, review state, and proof state too.

That is a much healthier arrangement.

`sqry` was a good fit for this codebase

I checked the semantic index while reviewing the repo. It exposed 102 files and 33,260 indexed symbols in the grokrs workspace.

That matters because simple text search is not enough for some of the questions you actually want to ask in a codebase like this:

where does policy gating happen
which commands route through the same transport bridge
which tools are exposed to the model
where is session state persisted
which tests cover a given path

This repo has enough structure that semantic navigation pays for itself. The DAG and review system also make semantic tooling more useful because the work is already decomposed into named slices.

The most interesting adjacent repo is `dag-toml-templates`

If grokrs shows the current operating model, /srv/repos/internal/verivusai-labs/dag-toml-templates shows where that model is going.

Its README.md still presents a canonical versioned release surface for template packages. That matters. The file-based layer is not being discarded.

At the same time, the research and design docs are explicit about the next move.

research/DATABASE_REPLACEMENT_RESEARCH.md frames the problem as TOML DAG Templates → Structured Database. It evaluates database candidates for the three process-control packages.

The final ranking puts SurrealDB 3.0 first with 100/130. The recommended architecture keeps database state in SurrealDB while exporting and importing through aivcs.

Then docs/superpowers/specs/2026-04-06-v2-surrealdb-adoption-design.md makes the transition explicit. It defines v2 as a SurrealDB-Backed Hybrid Template Pack. The important word there is Hybrid.

I think that is the correct direction.

The point is not to throw away TOML. The point is to stop asking static files to act like a live workflow database.

What the `dagdb` package already proves

The implementation under src/dagdb/ is not theoretical.

migrate.py includes:

detect_toml_type()
import_toml_file()
import_dag()
import_traceability()
import_review_readiness()
export_dag()

That means the system can ingest the three major TOML package families into a database model and, for DAGs, reconstruct TOML-compatible output on the way back out.

history.py adds unit_history and edge_history, plus get_dag_state_at() for point-in-time reconstruction.

invariants.py classifies which invariants can live in the database and which still need application code.

schema_migration.py adds schema_migrations, apply and rollback behavior, and explicit migration-state handling.

This is the part I find most persuasive. The move from TOML to structured state is not being described only in prose. It is being built as code.

The prototype results are honest

The prototype evaluation is one of the better research-to-implementation bridges I have seen in a repo like this.

research/SURREALDB_PROTOTYPE_EVALUATION.md reports:

34 total tables and edge tables across DAG, traceability, and review-readiness data
13 of 29 validator checks classified as db_enforced
9 of 29 classified as query_checkable
7 of 29 classified as app_required
75.9% combined DB-plus-query coverage
347 collected tests
343 passes
4 xfails tied to time-travel support

The most useful finding in that report is the limitation section. The VERSION clause is accepted syntactically. It does not actually perform historical time travel in the tested embedded setup. The repo says that plainly and uses explicit history tables as the workaround.

I trust systems more when they write down their failed assumptions.

The invariant classification is just as good. The repo does not pretend a database will magically solve graph algorithms. It says computed values like entry points, leaf nodes, critical path, and maximum parallelism still belong in application code.

That is the split I would want too:

database for durable state, relations, audit, and constraints
application for graph algorithms and orchestration logic

What dev.to readers can copy from this repo

I do not think most teams need this exact stack. I do think more teams should steal the shape of it.

If you are building an AI-heavy internal tool, these are the pieces I would copy first.

1. Put the safety model in types

Do not leave trust, path safety, and effect handling as loose runtime conventions. Make them visible in the type system and module boundaries.

2. Write subsystem specs before parallel AI work starts

You do not need long specs. You do need named contract surfaces. A short subsystem spec beats a hundred lines of prompt history.

3. Use a DAG when work will be parallel

A task list is fine for one human. A DAG is better when several workers, human or model, are moving at once.

4. Require proof artifacts, not just diffs

This repo’s contract, evidence, traceability, and readiness files are doing something very practical. They force work to explain itself.

5. Keep a human-reviewable export surface

The move in dag-toml-templates is not file versus database. It is file plus database. I think that is the right model for most engineering systems with both human review and live workflow state.

What I think is the real lesson

The interesting result here is not that AI models can write a lot of code quickly. We already know that.

The interesting result is that a repo can move quickly, use multiple agents, accumulate real functionality, and still stay reviewable if the team is strict about where meaning lives.

In grokrs, meaning lives in:

the crate graph
the spec docs
the implementation DAGs
the evidence and traceability artifacts
the policy model

That is why the repo still feels like engineering.

Evidence note

I grounded this article in repository-visible evidence available on 2026-04-06, including the root workspace manifest, crate layout, architecture and spec docs, dated review artifacts, command and module structure, semantic index results, and the adjacent dag-toml-templates design and research documents.

Some claims about the exact outer AI/VCS orchestration layer remain inference rather than direct confirmation, because those metadata surfaces are suggestive but not fully self-describing on their own.

DEV Community

How We Built a Safety-First Rust Agent CLI in Two Days Without Letting the Codebase Turn to Mush

The short version

What was actually built

Why the safety model works

The command surface is already broad