DEV Community

Cover image for How We Built a Safety-First Rust Agent CLI in Two Days Without Letting the Codebase Turn to Mush
Verivus OSS Releases
Verivus OSS Releases

Posted on

How We Built a Safety-First Rust Agent CLI in Two Days Without Letting the Codebase Turn to Mush

The short version

I think most AI-assisted software fails in one of two ways.

The first failure mode is obvious. The code is sloppy, the boundaries are fuzzy, and the whole thing feels like a transcript that got committed by accident.

The second failure mode is more subtle. The code is fine for a demo, but the repo has no durable planning model, no review trail, and no way to explain why one subsystem looks the way it does. A week later, nobody wants to touch it.

grokrs avoided both.

This repo is a Rust-only scaffold for a Grok-oriented agent CLI. It is safety-first by design. More importantly, it was built fast without taking the usual shortcuts that make a codebase hard to trust. The artifact trail shows a concentrated implementation burst across 2026-04-05 and 2026-04-06, but the result still has clear crate boundaries, deny-by-default policy handling, machine-readable planning, and a review system that is stronger than what I usually see in projects with much longer schedules.

I want to walk through why that happened, because I think the process is as interesting as the software.

I also used grokrs itself to generate the article images and submit this draft to Dev.to. That felt like a fair test of whether the CLI is already useful outside its own repository.

What was actually built

At the workspace level, grokrs is not a single crate with a heroic main.rs. The root Cargo.toml defines eight members:

  • grokrs-core
  • grokrs-cap
  • grokrs-policy
  • grokrs-session
  • grokrs-tool
  • grokrs-cli
  • grokrs-api
  • grokrs-store

That split matters. Each crate has a job. Each crate also imposes a limit.

The repository currently has 130 Rust source files under crates/, 55,963 total Rust source lines, and about 41,990 Rust code lines once blank lines and comment-only lines are stripped out. A quick scan of test annotations turns up 1,647 #[test] and #[tokio::test] markers in the crate tree. The docs side is not small either. There are 5 top-level specs in docs/specs/, and 15 IMPLEMENTATION_DAG.toml files under docs/reviews/.

This is not a toy codebase pretending to be architecture.

I reran a fresh full sqry index on 2026-04-07 because I wanted a better read on the codebase before publishing this. The new index came back with numbers that are hard to wave away:

  • 131 indexed files
  • 46,997 indexed symbols
  • 61,393 graph edges
  • 10,537 functions and 6,760 call sites
  • 0 cycles in the current graph snapshot

The extended graph details are just as telling. sqry broke the workspace down into 24,830 variables, 1,269 imports, 1,157 types, 870 macros, 509 methods, 339 modules, 285 structs, and 81 enums. It also reported 0 cross-language edges, 4,843 duplicate groups, and 3,898 unused symbols. That is the kind of inventory I expect from a real codebase, not a weekend mockup.

For raw size, the current Rust tree under crates/ is 55,963 total source lines, or about 41,990 code lines by a simple blank-and-comment strip. The git history in this repository shows 55,888 net Rust source lines landing on 2026-04-06, so the visible implementation burst was extremely concentrated even if the project story still spans two days.

The crate breakdown is clean:

  • grokrs-cap carries rooted path handling and trust-level types
  • grokrs-policy carries effect classification and deny-by-default evaluation
  • grokrs-tool carries tool traits, classification, and registry logic
  • grokrs-api carries xAI and Grok transport, streaming, endpoints, and tool-loop code
  • grokrs-store carries SQLite WAL persistence
  • grokrs-session carries typed lifecycle state
  • grokrs-core carries config and shared domain types
  • grokrs-cli carries user-facing commands and orchestration

I think this is one of the big reasons the repo stayed readable while moving fast. Lower-level safety primitives do not depend on the CLI. The API crate gets a policy gate injected at runtime rather than importing policy code directly. The store crate stays focused on state. Each boundary removes a class of later confusion.

Editorial illustration of crate boundaries and architecture

Why the safety model works

The top-level architecture doc says the project wants four things:

  • explicit trust boundaries
  • a rooted filesystem model
  • effects classified before execution
  • a modular implementation that can grow without a rewrite

The code follows through on that.

Trust is encoded in types. Sessions are parameterized by trust level. Path handling is rooted through WorkspaceRoot and WorkspacePath. The policy engine works in terms of explicit effects such as FsRead, FsWrite, ProcessSpawn, and NetworkConnect. The defaults are conservative. Network is denied by default. Shell spawning is denied by default. Workspace writes require validated relative paths.

That is what I want to see in an agent CLI. The system is opinionated before the first command runs.

The tool path is especially solid. In the executor flow, a tool call gets looked up, classified, policy-checked, then executed. The approval behavior is explicit too:

  • allow maps Ask to Allow
  • deny maps Ask to Deny
  • interactive preserves Ask, but current comments make clear that this is effectively a deny path until the approval broker is implemented

That sequence matters. It means the project did not fake the approval layer just to keep the demo flowing.

The command surface is already broad

The first spec in docs/specs/00_SPEC.md is intentionally modest. It says the initial release does not promise a production agent runtime yet. It wants to establish the boundaries needed to build one safely.

That makes the current command surface more interesting, not less.

The repo already supports:

  • direct API operations through grokrs api
  • interactive REPL chat through grokrs chat
  • tool-calling agent execution through grokrs agent
  • management work through collections
  • media generation through generate
  • model discovery through models
  • session and store inspection through sessions and store
  • runtime posture and config inspection through doctor and show_config

The architecture doc also calls out voice, MCP client support, search integration, encrypted reasoning replay, prompt caching, and memory tools. This is well past the point where you can call it a shell around one endpoint.

What I like here is the sequencing. The project did not start with a magical agent and then backfill the boring layers. It built the capability model, the policy path, the store, and the command surface together.

The real story is in the documentation system

If you only read the code, you will understand the runtime. If you read the docs tree, you understand the development method.

The visible docs include:

  • AGENTS.md
  • CLAUDE.md
  • ARCHITECTURE.md
  • docs/specs/00_SPEC.md
  • docs/specs/01_XAI_API_CLIENT.md
  • docs/specs/02_SQLITE_STORE.md
  • docs/specs/03_APPROVAL_BROKER.md
  • docs/specs/04_MCP_SERVER.md
  • docs/design/00_ARCHITECTURE.md
  • docs/design/01_SQLITE_STATE.md
  • docs/development/grokrs/01_SPEC.md
  • docs/development/grokrs/03_IMPLEMENTATION_PLAN.md
  • docs/development/grokrs/04_XAI_API_IMPLEMENTATION_PLAN.md
  • docs/development/grokrs/05_AGENT_ORCHESTRATION_PROMPT.md
  • docs/ops/00_BOOTSTRAP.md
  • docs/reviews/AI_SLOP_REVIEW_GUIDE.md

That is a full process stack. It covers scope, architecture, implementation planning, operations, and review posture.

I do not think these docs were written as decoration. They function as a control plane for the repo.

Specs as execution boundaries

The subsystem specs are already separated:

  • the product spec
  • the xAI API client spec
  • the SQLite store spec
  • the approval broker spec
  • the MCP server spec

That turns requirements into named contract surfaces.

In an AI-assisted environment, that is a bigger deal than people often admit. If you want parallel work to stay coherent, you need somewhere more durable than a chat thread to define the intended behavior of a subsystem. These spec docs do that job.

Review artifacts as first-class outputs

The review tree is even more revealing.

Under docs/reviews/, the repo includes named review domains for bootstrap, approval broker, batch extensions, collections management, document search, MCP server, remote MCP tools, responses enrichment, security hardening, SQLite store, TTS API, xAI API client, clippy pedantic cleanup, competitive features, and competitive gap analysis.

The bootstrap bundle on 2026-04-05 contains 6 files:

  • CONTRACT_DECLARATION.toml
  • EVIDENCE_MATRIX.toml
  • IMPLEMENTATION_DAG.toml
  • README.md
  • REVIEW_READINESS.toml
  • TRACEABILITY.toml

That combination is doing serious work.

CONTRACT_DECLARATION.toml states the promise.
IMPLEMENTATION_DAG.toml structures the work.
TRACEABILITY.toml ties implementation back to intent.
EVIDENCE_MATRIX.toml says what proof should exist.
REVIEW_READINESS.toml says when the artifact set is actually inspectable.

I think that is the right abstraction. Review is not an afterthought at the end of coding. Reviewability is part of the deliverable.

Editorial illustration of DAG, evidence, and review artifacts

Why IMPLEMENTATION_DAG.toml mattered so much

Plenty of teams keep a task list. That is not the same thing.

A DAG tells you which work units can move in parallel, which ones are blocked, and where integration risk sits. That is exactly what you need when multiple agents or reviewers are touching the same repo.

This repo has 15 implementation DAG files under docs/reviews/. That tells me the DAG pattern was not a bootstrap stunt. It became part of the operating model.

I think the big benefits are pretty concrete:

  • parallel work can move without guesswork about order
  • write scopes stay smaller
  • review can happen against a declared node instead of a vague feature story
  • reintegration gets easier because dependencies stay visible

That last point matters more than people think. AI agents are good at filling in local structure. They are not naturally good at keeping a whole repo’s execution order in their head unless you give them an artifact that does that for them.

This repo used AI agents and kept its shape

The top-level environment gives away a lot:

  • .aivcs
  • .claude
  • .sqry

The adjacent dag-toml-templates repo adds .continue, .factory, .windsurf, and .agents.

That is an AI-native build environment. I think that is obvious.

What is not obvious, and what is worth learning from, is that the repo did not let the agent workflow become the architecture.

The architecture stayed in crates.
The intent stayed in specs.
The execution order stayed in DAGs.
The proof stayed in evidence and traceability artifacts.

That changes the role of the model. The model is not only there to generate code. It is expected to leave behind planning state, review state, and proof state too.

That is a much healthier arrangement.

sqry was a good fit for this codebase

I checked the semantic index while reviewing the repo. It exposed 102 files and 33,260 indexed symbols in the grokrs workspace.

That matters because simple text search is not enough for some of the questions you actually want to ask in a codebase like this:

  • where does policy gating happen
  • which commands route through the same transport bridge
  • which tools are exposed to the model
  • where is session state persisted
  • which tests cover a given path

This repo has enough structure that semantic navigation pays for itself. The DAG and review system also make semantic tooling more useful because the work is already decomposed into named slices.

The most interesting adjacent repo is dag-toml-templates

If grokrs shows the current operating model, /srv/repos/internal/verivusai-labs/dag-toml-templates shows where that model is going.

Its README.md still presents a canonical versioned release surface for template packages. That matters. The file-based layer is not being discarded.

At the same time, the research and design docs are explicit about the next move.

research/DATABASE_REPLACEMENT_RESEARCH.md frames the problem as TOML DAG Templates → Structured Database. It evaluates database candidates for the three process-control packages.

The final ranking puts SurrealDB 3.0 first with 100/130. The recommended architecture keeps database state in SurrealDB while exporting and importing through aivcs.

Then docs/superpowers/specs/2026-04-06-v2-surrealdb-adoption-design.md makes the transition explicit. It defines v2 as a SurrealDB-Backed Hybrid Template Pack. The important word there is Hybrid.

I think that is the correct direction.

The point is not to throw away TOML. The point is to stop asking static files to act like a live workflow database.

What the dagdb package already proves

The implementation under src/dagdb/ is not theoretical.

migrate.py includes:

  • detect_toml_type()
  • import_toml_file()
  • import_dag()
  • import_traceability()
  • import_review_readiness()
  • export_dag()

That means the system can ingest the three major TOML package families into a database model and, for DAGs, reconstruct TOML-compatible output on the way back out.

history.py adds unit_history and edge_history, plus get_dag_state_at() for point-in-time reconstruction.

invariants.py classifies which invariants can live in the database and which still need application code.

schema_migration.py adds schema_migrations, apply and rollback behavior, and explicit migration-state handling.

This is the part I find most persuasive. The move from TOML to structured state is not being described only in prose. It is being built as code.

The prototype results are honest

The prototype evaluation is one of the better research-to-implementation bridges I have seen in a repo like this.

research/SURREALDB_PROTOTYPE_EVALUATION.md reports:

  • 34 total tables and edge tables across DAG, traceability, and review-readiness data
  • 13 of 29 validator checks classified as db_enforced
  • 9 of 29 classified as query_checkable
  • 7 of 29 classified as app_required
  • 75.9% combined DB-plus-query coverage
  • 347 collected tests
  • 343 passes
  • 4 xfails tied to time-travel support

The most useful finding in that report is the limitation section. The VERSION clause is accepted syntactically. It does not actually perform historical time travel in the tested embedded setup. The repo says that plainly and uses explicit history tables as the workaround.

I trust systems more when they write down their failed assumptions.

The invariant classification is just as good. The repo does not pretend a database will magically solve graph algorithms. It says computed values like entry points, leaf nodes, critical path, and maximum parallelism still belong in application code.

That is the split I would want too:

  • database for durable state, relations, audit, and constraints
  • application for graph algorithms and orchestration logic

What dev.to readers can copy from this repo

I do not think most teams need this exact stack. I do think more teams should steal the shape of it.

If you are building an AI-heavy internal tool, these are the pieces I would copy first.

1. Put the safety model in types

Do not leave trust, path safety, and effect handling as loose runtime conventions. Make them visible in the type system and module boundaries.

2. Write subsystem specs before parallel AI work starts

You do not need long specs. You do need named contract surfaces. A short subsystem spec beats a hundred lines of prompt history.

3. Use a DAG when work will be parallel

A task list is fine for one human. A DAG is better when several workers, human or model, are moving at once.

4. Require proof artifacts, not just diffs

This repo’s contract, evidence, traceability, and readiness files are doing something very practical. They force work to explain itself.

5. Keep a human-reviewable export surface

The move in dag-toml-templates is not file versus database. It is file plus database. I think that is the right model for most engineering systems with both human review and live workflow state.

What I think is the real lesson

The interesting result here is not that AI models can write a lot of code quickly. We already know that.

The interesting result is that a repo can move quickly, use multiple agents, accumulate real functionality, and still stay reviewable if the team is strict about where meaning lives.

In grokrs, meaning lives in:

  • the crate graph
  • the spec docs
  • the implementation DAGs
  • the evidence and traceability artifacts
  • the policy model

That is why the repo still feels like engineering.

Evidence note

I grounded this article in repository-visible evidence available on 2026-04-06, including the root workspace manifest, crate layout, architecture and spec docs, dated review artifacts, command and module structure, semantic index results, and the adjacent dag-toml-templates design and research documents.

Some claims about the exact outer AI/VCS orchestration layer remain inference rather than direct confirmation, because those metadata surfaces are suggestive but not fully self-describing on their own.

Top comments (0)