DEV Community: Werner Kasselman

Going Remote, Without Going Reckless: Multi-LLM Orchestration and the New Front Door in llm-cli-gateway 2.9.0

Werner Kasselman — Sun, 14 Jun 2026 12:35:40 +0000

The earlier posts in this series were about what the gateway lets you call (cache-aware spawning across five providers, the Codex review gate, the CLI-versus-API argument) and the one before this was about the parts that do not show up as a tool, the upstream-tracking and the website that became the project's front door. This one is about a different kind of front door: the gateway can now listen over HTTP, behind real authentication, and serve more than one caller without those callers being able to read each other's work. That sounds like a small toggle. It is not. Moving an MCP server off a local pipe and onto a network port changes the trust boundary completely, and 2.9.0 is the release where I sat down and remediated all seventeen findings from a multi-LLM red-team of exactly that surface before telling anyone the remote path was ready.

Short version: llm-cli-gateway is one Model Context Protocol server that wraps five vendor CLIs (Claude Code 2.1.177, Codex 0.139.0, Google's Antigravity agy 1.0.8, Grok 0.2.51, and Mistral Vibe 2.14.1) behind a single, uniform tool surface, so one orchestrating agent can fan a task out to several models, collect independent opinions, run a red-team or a consensus check, and keep durable session and job state across all of it. Until recently that only made sense on localhost over stdio. As of 2.9.0 the same server runs over HTTP with a static bearer token or a built-in OAuth 2.0 authorisation server (PKCE on by default, an opt-in human-consent gate, and a trusted-principal-header seam for when you front it with your own identity-aware proxy), every session and job and stored request is stamped with an owner principal and access is enforced per principal, remote provider calls are refused unless a workspace is registered, and the whole thing fails closed rather than open when the configuration is dangerous.

Long version is below, same shape as always: what it enables, what the remote options actually are, a stack of worked scenarios, and the caveats named up front rather than buried at the bottom.

What the gateway actually enables

Before the remote story, it is worth being precise about what the thing does, because "multi-LLM orchestration" is the kind of phrase that means nothing until you name the tools.

Each provider gets two tools, a synchronous one and an asynchronous one: claude_request and claude_request_async, and the matching pairs for codex, gemini, grok, and mistral. Codex additionally exposes codex_fork_session, because forking a reasoning session into a branch is a Codex-specific capability worth surfacing rather than hiding. An orchestrating agent (Claude Code on your laptop, say) calls these the way it would call any other MCP tool, and the gateway handles the spawn, the timeout, the output-size cap, the retry-with-circuit-breaker, the token accounting, and the structured error.

On top of the raw provider calls there is a validation layer, and this is the part I reach for most: validate_with_models sends the same claim to several models independently and reports where they agree and where they do not; second_opinion is the one-model version; red_team_review runs an adversarial pass; consensus_check asks whether a set of models actually agrees on a specific claim rather than vaguely nodding along; compare_answers does a local diff with no provider calls at all; synthesize_validation runs a judge over the collected results; and ask_model is the deliberately simple "just ask one model this" entry point. The reason this matters is that a single model's confident answer is a sample of one, and on anything that carries risk (a migration, a security claim, an architecture decision) I would rather pay three-to-four times the tokens and get three independent reads, two of which from a different vendor family entirely, than trust one well-phrased paragraph.

Underneath all of that sits the state. Sessions persist conversation continuity per CLI (Claude's --continue, Codex's exec resume, Grok's --resume, and the rest), stored minimally with no conversation content on disk. Async requests are durable: a synchronous call auto-defers at 45 seconds (configurable via SYNC_DEADLINE_MS) and hands you a job id, the job runs to completion in the background, and you collect it later with llm_job_status and llm_job_result, or cancel it with llm_job_cancel. The job store is SQLite by default (Node's built-in node:sqlite, no native module to compile), results are retained for thirty days, and an identical re-issued call inside a one-hour window reattaches to the live job rather than starting a second one. Frankly, that durability is the unglamorous feature that saves the most grief: a long red-team sweep that runs for twelve minutes does not vanish because your polling wrapper timed out or your orchestrator restarted. Every request, sync or async, is also logged to a flight recorder (logs.db) with timing, token usage, cache statistics, the approval decision, and the exit code, with secrets redacted before they are written.

Token optimisation (opt-in prompt and response compression) and an approval gate (approvalStrategy:"mcp_managed", which scores each request and can deny it) round it out. None of this is exotic. It is the plumbing you would build anyway if you wanted to orchestrate five CLIs seriously, written once, in one place, so you do not rebuild it five times.

Why remote is the genuinely hard part

On localhost over stdio, the trust model is trivial: there is one user, it is you, the pipe is private to your process tree, and the only principal that exists is "the person sitting at this machine." The gateway has always called that principal local, and a local caller can see everything, because everything is theirs.

The moment the server binds a TCP port, every one of those assumptions evaporates. Now there can be more than one caller, the callers may not trust each other, the bytes arriving on the socket are untrusted until proven otherwise, and "can this caller read this session?" stops being a question with an obvious answer. I know that the temptation, and I have seen plenty of projects give in to it, is to ship the HTTP transport with a "set a token if you like" note and call authorisation someone else's problem. I did not want to do that, because the gateway spawns real CLIs that read and write real files, and a remote surface that leaks one tenant's jobs into another tenant's view, or lets an unauthenticated caller trigger a provider spawn against your repository, is not a convenience, it is a liability with a friendly README.

So before documenting the remote path as ready, I ran an extensive red-team of the external and internal MCP surface across four models (Claude orchestrating, with Codex, Grok, and Mistral reviewing independently, each verifying claims against the code rather than accepting a summary), consolidated it down to seventeen confirmed findings, and remediated every one of them. 2.9.0 is that remediation. The numbered findings below (F1, F3, F14, and so on) are the internal labels, kept here only because they map cleanly onto the features.

The new remote options, concretely

Transports. The default is still stdio, and if you never set anything you get exactly the local behaviour you had before. Set the transport to HTTP and the gateway stands up a Streamable-HTTP MCP endpoint, binding to 127.0.0.1 on port 3333 at path /mcp by default (LLM_GATEWAY_HTTP_HOST, LLM_GATEWAY_HTTP_PORT, LLM_GATEWAY_HTTP_PATH). There is also a dormant Agent Client Protocol transport whose foundations shipped in 2.8.0, off by default, which I will write about separately once the later phases land.

Static bearer authentication. The simplest remote auth: set LLM_GATEWAY_AUTH_TOKEN, and every request must present it as a bearer token, compared in constant time so the check does not leak the token through its own timing. This is the right choice for a single-user remote setup (your own server, reachable only by you).

A built-in OAuth 2.0 authorisation server. For anything multi-client, the gateway can be its own OAuth server, configured under [http.oauth] in ~/.llm-cli-gateway/config.toml. It exposes /oauth/authorize, /oauth/token, /oauth/register, and the /.well-known/oauth-protected-resource metadata an MCP client reads to configure itself. PKCE is required by default (require_pkce = true), plain PKCE is off, public clients are off, the client-registration policy defaults to the conservative static_clients (with shared_secret and a dev-only open_dev available when you opt in), and tokens live for an hour by default. The point of the built-in server is that an MCP client which speaks OAuth can discover the metadata, register or use a pre-configured client, run the authorisation-code flow, and start calling tools, without you bolting on a separate identity product just to try the remote path.

An opt-in human-consent gate (F14b). An authorisation-code flow that mints a token the instant a client asks, with no human in the loop, is the kind of thing that looks fine until someone points out that "the client asked nicely" is not authentication of the resource owner. So there is now an optional consent gate: set require_consent = true (or LLM_GATEWAY_OAUTH_REQUIRE_CONSENT=1) with a dedicated, scrypt-hashed consent password, and before any code is issued the operator has to approve at a small consent screen, protected against cross-site request forgery with a double-submit cookie. It is off by default, because not every deployment wants a human in the loop, but it is there for the ones that do.

A trusted-principal-header seam (F14a). Plenty of people already run a proper identity-aware proxy in front of their services (an OAuth2 proxy, an identity-aware gateway, your own front door) and would rather terminate identity there than have the gateway reinvent it. For them there is a seam: set LLM_GATEWAY_TRUSTED_PRINCIPAL_HEADER to the name of a header your proxy injects, and the gateway will adopt that header's value as the calling principal, but only when the request also authenticated as the static gateway bearer, and only when the value matches a tight character set so it cannot smuggle anything into the logs. This is the deliberately IdP-agnostic part of the design: the gateway ships the seam, you bring whatever identity provider you already trust, and the gateway never needs to know which one it is.

Fail-closed posture (F17). If you enable OAuth with public clients or the dev-only open registration and then try to bind the server to a non-loopback address, the gateway refuses to start without an explicit operator override. It will not let you accidentally expose an open registration endpoint to the internet, and it no longer trusts the Host header to decide whether it is "really" on loopback. The default for a dangerous combination is to stop, not to shrug and serve.

Per-principal isolation, and an honest limit (F3)

Authentication tells you who is calling. Isolation decides what they can see, and the two are not the same thing. In 2.9.0 every session, every async job, and every persisted request is stamped with an owner principal at creation, and the read and mutate paths enforce it: a caller can see its own rows, a local caller can additionally see the legacy rows that predate isolation (those have no owner and would otherwise become invisible), and that is all. session_list is filtered to the caller, session_get and session_delete and the job and request-readback tools resolve the owner and return "not found" rather than someone else's data, and a remote OAuth client never sees a pre-isolation row at all.

Here is the honest limit, named up front: a single shared static bearer token is one principal, by design. If five engineers all authenticate with the same LLM_GATEWAY_AUTH_TOKEN, the gateway correctly treats them as one principal (gateway-bearer) and they share a view, because from the server's side they are indistinguishable. Genuine multi-tenancy, where each human is their own principal, needs either an OAuth client per user or the trusted-principal-header seam behind an identity-aware proxy. The static bearer is not a tenancy boundary, and 2.9.0 documents that.

Workspace gating for remote calls

A local stdio caller can pass any workingDir it likes, because it is you, on your machine, and the working directory is your own. A remote caller is a different proposition entirely, and letting an HTTP request name an arbitrary directory to spawn a provider against is precisely the sort of thing a red-team finding is made of. So remote provider requests (HTTP transport, or an OAuth-authenticated call) are refused unless the work is anchored to a registered workspace: a named alias from the [workspaces] registry, a workspace bound to the session, or a configured [workspaces].default. The registry itself is operator-controlled, and the tools that mint new workspaces (workspace_create, workspace_register_existing_repo) require both an environment opt-in (LLM_GATEWAY_WORKSPACE_ADMIN=1) and a workspace:admin OAuth scope, and can only operate inside explicitly allowed roots. Local stdio keeps the prior unrestricted behaviour, because there the restriction would buy nothing.

What the same release did to the approval gate (F15)

Two more things landed in 2.9.0 that are worth a paragraph, because they change a default. First, under approvalStrategy:"mcp_managed" a full permission or sandbox bypass request is now denied by default regardless of how the heuristic scored it, unless the operator explicitly opts back in. Second, and this is the behaviour change to be aware of, the managed strategy no longer force-sets each provider to its most permissive mode. It used to push Claude to bypassPermissions, Gemini to yolo, Grok to --always-approve, and Mistral to auto-approve on every single managed request, which meant a child could obtain a fully unlocked provider through the managed path without ever asking for a bypass. Now each provider defaults to an accept-edits-level mode instead: Claude and Grok to --permission-mode acceptEdits, Mistral to --agent accept-edits, and Gemini to its prompted default (the agy CLI has no accept-edits rung at all, so the safe default is prompted execution, and that is the honest trade-off, Gemini cannot auto-approve mutating tools under the managed strategy unless you opt in). Setting LLM_GATEWAY_APPROVAL_ALLOW_BYPASS=1 restores the old full-auto-approve behaviour across the board, for the headless operator who genuinely wants it. Secrets are also redacted from the flight recorder now (F4), so prompts and responses stored for tracing no longer carry tokens or keys in the clear.

Worked scenarios

Abstract capability lists are easy to nod along to and hard to act on, so here are concrete situations where this actually earns its keep.

Scenario one: the solo developer's three-reviewer loop, all local. You are writing something security-sensitive on your laptop, an auth handler, say. You have Claude Code open as your orchestrator, with the gateway configured over stdio. You implement with codex_request (full-auto, sandboxed workspace-write), then fan the result out to three independent reviewers in parallel with claude_request_async, gemini_request_async, and grok_request_async, each told to end with PASS or FAIL and findings. You poll every sixty seconds, union the findings, and treat anything two reviewers independently flag as high-confidence. No network, no auth, no tenancy questions, just five CLIs orchestrated from one place, and the durable jobs mean a fifteen-minute Codex pass survives you wandering off to make coffee. This is the case the gateway has always served, and remote changes nothing about it.

Scenario two: a small team, each engineer their own principal. Now there are four of you, and you want a shared gateway running on a box in the corner so everyone benefits from one warmed-up configuration and one flight recorder, but you absolutely do not want Jess's sessions showing up in Soumik's session_list. You front the gateway with the identity-aware proxy you already run, terminate each engineer's identity there, and have the proxy inject the authenticated username into the header named by LLM_GATEWAY_TRUSTED_PRINCIPAL_HEADER, with the proxy-to-gateway hop authenticated by the static bearer. Now each engineer is their own owner principal, isolation is enforced per person, and the gateway never had to learn the first thing about your identity provider. The bit that makes this safe rather than for show, is that the header is only trusted on the bearer-authenticated hop, so a caller cannot simply set the header themselves and impersonate a colleague.

Scenario three: calling the gateway from a phone, over OAuth, with the consent gate on. You want to reach your own orchestration server from an MCP client on a tablet while you are away from the desk. You enable the built-in OAuth server, set LLM_GATEWAY_PUBLIC_URL so the issuer metadata is correct, turn on require_consent with a consent password only you know, and put the server behind TLS (the fail-closed posture will stop you binding an open-registration server to a public address by accident anyway). The client discovers the metadata, runs the PKCE authorisation-code flow, and pauses at the consent screen; you approve it once, from your phone, and the token it receives lives for an hour. Every job that client creates is owned by that client's principal, so even if you later add a second client it cannot read the first one's history. This is the scenario the consent gate exists for: a token is not minted just because something on the network asked politely, a human approved it.

Scenario four: durable async jobs from CI or a cron, headless. You want a nightly job that asks three models to review the day's merged diffs and files anything serious. A headless runner has no human to approve per-action prompts, so you set LLM_GATEWAY_APPROVAL_ALLOW_BYPASS=1 deliberately (you are the operator, you know the box is yours, and you want full-auto), fire the three *_request_async calls, and let the run end. The jobs are durable for thirty days, so a separate collector step hours later pulls the results by job id, and if the nightly job and a manual run happen to issue an identical request inside the dedup window they reattach to the same job rather than paying for it twice. The flight recorder gives you the token spend and timing per call afterwards, with no secrets in it.

Scenario five: a full red and blue cycle on a risky change. You have written something you are nervous about. You run the red-team-assessment flow: dispatch the same target to Claude (architecture and trust boundaries), Codex (logic bugs and races), Gemini (known vulnerability classes and dependency CVEs via the research tools), and Grok (the independent perspective that contradicts the other three's shared assumptions), each ending in PASS or FAIL. On a FAIL you send the findings to a different model for the blue-team response (the defender should not be the model that found the attack), implement the fixes, and re-dispatch to the original red-teamers until they all pass. Worth knowing in 2.9.0: under the managed strategy these reviewers now get accept-edits-level access by default, which is plenty for reading code with Read, Grep, and Glob plus the sqry, exa, and ref_tools MCP servers, but a reviewer that needs to run shell commands headlessly needs the operator opt-in. For pure analysis, the safer default is the right default.

Scenario six: cross-vendor consensus before a one-way-door decision. You are about to commit to a database migration that is genuinely hard to reverse. Rather than ask one model and convince yourself, you put the migration plan through validate_with_models across all five providers and then consensus_check on the specific claim that matters ("this migration is online-safe and reversible within the deploy window"). Where the five agree, you have a real signal; where one dissents, you have found the question you had not asked yet. synthesize_validation gives you the judge's-eye summary over the lot. I know this sounds like over-engineering a decision, but the asymmetry is the whole point: the consensus pass costs a few minutes and some tokens, and the migration going wrong costs a weekend, so the trade is not close.

Caveats, named up front

The static bearer is one principal, not a tenancy boundary; if you need per-user isolation you need OAuth clients per user or the proxy-plus-header seam, and sharing one token means sharing one view, full stop. The built-in OAuth server is a real authorisation server but it is not a substitute for an enterprise identity platform, and the trusted-principal-header seam exists precisely so you can put your real one in front; the gateway is opinionated about being IdP-agnostic. The consent gate and the bypass opt-in are both genuinely off and on switches with real consequences, so read what they do before flipping them. The Postgres persistence backend is an interface only today, not a shipped implementation; SQLite is the durable default and memory is the ephemeral one you must explicitly acknowledge. And the accept-edits default under the managed strategy is a behaviour change from 2.8.0, so if you were relying on managed mode handing your providers a full bypass, that now needs the explicit opt-in. These are the sharp edges, and I would rather you meet them in a list here than in production.

What's next

The Agent Client Protocol transport is in the tree but dormant, and the later phases of that work are where my attention goes next; the per-provider accept-edits tightening is complete across all five; and the remote surface, having been red-teamed down to zero open findings, is now the thing I would actually trust to face a network. As ever, the honest position is that wrapping five vendor CLIs that each move on their own cadence is a moving target, and the work is never finished so much as kept current. But the front door is real now, it locks, and it knows who is knocking.

The gateway is on npm as llm-cli-gateway (2.9.0 is current), the source and signed installer artefacts are on the public mirror, and the website at llm-cli-gateway.dev is built so an MCP client can read one URL and configure itself. If you orchestrate more than one model, or you are about to, I think it is worth the ten minutes.

Hardening API Scan Boundaries in skill-scanner, with sqry as the Review Map

Werner Kasselman — Sun, 14 Jun 2026 12:24:42 +0000

On 14 June 2026 I cloned cisco-ai-defense/skill-scanner, set up the locked uv environment, and worked through one small but important question: what does it take to make the REST API safer when the API can scan local directories, accept uploaded ZIP files, run optional analyzers, and queue batch work in the background?

I am not pretending this is a universal API security methodology, or that one branch makes a whole product "secure" in the abstract. This is a narrower story, and I think the narrowness is the useful part: a concrete pass over one public Python repository, with a hardening branch called codex/harden-api-scan-boundaries, ending in commit 2cfa313 and draft PR #119, where the evidence was code, tests, docs, and a graph of the repository rather than a confident read of the obvious files.

The branch changed 24 files, with 1186 insertions and 210 deletions. The main implementation files were skill_scanner/api/router.py, skill_scanner/core/analyzer_factory.py, skill_scanner/core/extractors/content_extractor.py, skill_scanner/core/loader.py, and skill_scanner/core/scanner.py, plus two new shared modules: skill_scanner/core/archive_limits.py and skill_scanner/core/fs_limits.py.

The target: an API wrapped around local scanning work

skill-scanner scans Agent Skill packages. It has CLI paths, Python library paths, eval paths, pre-commit hook paths, and a FastAPI router that exposes endpoints for direct skill scans, uploaded ZIP scans, batch scans, batch-result polling, health checks, and analyzer listing.

That matters because the REST API does not sit in front of a simple database lookup. It sits in front of local filesystem access, archive extraction, analyzer construction, optional remote-service analyzers such as VirusTotal and Cisco AI Defense, LLM-backed analysis, scanner traversal, loader discovery, and report generation. A bug in one visible route handler can be obvious. A missing bound in a shared loader, reached through API, CLI, evals, tests, and scanner methods, is much easier to miss.

The first setup step was boring and necessary:

uv sync --frozen --all-extras --dev

That gave the API dependencies, analyzer extras, pytest, lint tooling, and the project commands needed to move from reading code to running it. The repository also had clear contribution constraints in CONTRIBUTING.md: include tests for changed behaviour, update docs where behaviour or configuration changes, use a conventional commit, keep the uv.lock model intact, and verify with the repository's normal commands.

The hardening target became four broad risk classes:

API callers should not be able to turn server-side path handling into arbitrary filesystem access.
Uploaded archive names and archive contents should not control where the server writes or what it follows.
Request-controlled expensive work needs caps, especially batch scans, traversal, archive expansion, loader discovery, and LLM consensus runs.
Operator-side configuration should remain operator-side configuration, especially remote analyzer endpoints.

There is also the basic API boundary: scan work and scan-result retrieval now require X-API-Key, and the expensive endpoints have process-local rate limiting. Root, health, and analyzer-listing endpoints remain informational.

Why sqry changed the shape of the review

The tool that changed the review was sqry, version 20.0.5. sqry uses "semantic" in the compiler sense, it parses code into ASTs, builds a graph of symbols and relationships, and answers structural questions from that graph. It is not an embedding search tool, and it is not just grep with better ranking.

The local index for this repository had 20,445 symbols across 202 files, with relation support enabled. The graph manifest recorded 26,120 edges across 200 Python files, one Ruby file, and one shell file. That is the practical reason it helped here: the API hardening problem crossed API request models, FastAPI handlers, shared scan implementation, analyzer construction, scanner traversal, loader discovery, archive extraction, documentation, and tests.

The first useful query was not clever:

sqry query 'path:skill_scanner/api/router.py AND kind:function'

It returned 98 function symbols from skill_scanner/api/router.py in about 35 ms on this checkout. More importantly, it produced a checklist that included scan_skill, _scan_skill_impl, scan_uploaded_skill, scan_batch, get_batch_scan_result, run_batch_scan, _validate_path, _count_batch_candidates, and _build_analyzers.

That sounds mundane until you compare it with a manual route read. A manual read tends to start from decorators and then follow the code that looks important. sqry gave me the public route handlers and the helpers in one structural inventory, before I had decided which parts mattered.

The scanner side was the same:

sqry query 'path:skill_scanner/core/scanner.py AND kind:function'

That returned 76 function symbols in about 31 ms, including SkillScanner.scan_skill, SkillScanner.scan_directory, and _find_skill_directories. The useful distinction was between single-skill scanning, directory discovery, and module-level convenience functions. For a hardening pass, that distinction is load-bearing.

Then the review shifted from "where is this string?" to "what code can reach this behaviour?"

sqry graph direct-callers _validate_path --json

sqry reported four direct callers: _resolve_policy, _scan_skill_impl, scan_batch, and run_batch_scan. That made the path gate concrete. It was not enough to harden the direct /scan path. The same gate needed to cover policy paths, direct skill paths, batch roots before queuing, and batch execution inside the background task.

The loader trace was the bigger warning:

sqry graph direct-callers 'SkillLoader.load_skill' --json

That returned 92 direct callers across evals, API code, CLI code, scanner code, and tests. This is where plain text search is weak. You can find load_skill text matches, but you still have to reason manually about which are method calls, convenience wrappers, test helpers, and shared execution paths. sqry made the broad shared surface visible, which is why the fix did not stop at the API router. The loader itself needed a bounded contract.

The same pattern showed up in analyzer construction. build_analyzers had 11 direct callers across API, CLI, hooks, evals, and tests. That meant llm_consensus_runs needed two checks: request-model validation at the API edge, and a second cap inside the analyzer factory so non-API callers get the same invariant.

For LLMAnalyzer._consensus_analyze, sqry reported one direct caller, LLMAnalyzer.analyze_async, which kept the execution-side analysis focussed. The cap belongs before construction reaches the analyzer loop.

Plain rg still had a place for exact strings, route decorators, docs, and final sanity checks. The difference is that sqry gave the graph-backed layer: functions and methods instead of arbitrary text, same-name symbols separated across API, CLI, hooks, evals and tests, and caller/callee traces for security-sensitive helpers.

What was fixed

The API path boundary now fails closed. _validate_path rejects null bytes, resolves the supplied path, and denies access unless SKILL_SCANNER_ALLOWED_ROOTS is configured and the resolved path is inside one of those roots. If no roots are configured, API filesystem access is denied.

That is a deliberate posture. An API that scans local paths should not assume that "current working directory" is a sensible trust boundary, and it should not silently accept arbitrary absolute paths because the caller knows them.

The upload path changed in a similarly blunt way. /scan-upload still checks the client-provided filename to require a .zip upload, but the server no longer uses that filename for the staging path. Uploaded bytes are written to:

zip_path = temp_dir / "upload.zip"

That small line removes an entire class of filename-controlled staging behaviour. Around it, the upload flow now streams in 1 MB chunks, enforces a 50 MB upload limit, reads ZIP EOCD metadata before constructing ZipFile, rejects ZIPs over 500 entries, rejects uncompressed ZIP contents over 200 MB, rejects path traversal entries by resolving each destination under the extraction root, rejects symlink entries, checks again after extraction that no symlink appeared on disk, and only then searches the extracted tree for SKILL.md using a bounded walk.

The EOCD preflight lives in skill_scanner/core/archive_limits.py as read_zip_member_count. It reads the ZIP end-of-central-directory metadata, including the ZIP64 case, before the code has to build a ZipFile object and iterate the archive. The same helper is used by the API upload handler and by ContentExtractor, so archive member-count limits are not two unrelated implementations that can drift.

The traversal helpers live in skill_scanner/core/fs_limits.py:

iter_directory_bounded
walk_directory_bounded

Both are based on os.scandir, and both count entries as they are yielded rather than first materialising a whole tree. They are now used by API batch preflight, scanner directory discovery, loader file discovery, lenient markdown synthesis, and uploaded-tree search. That is the kind of change that looks less exciting than a route patch, but it is exactly where the graph evidence mattered. If the loader has 92 direct callers, the loader cannot depend on the API being the only adult in the room.

Batch scanning now validates the batch root, counts candidates before queueing background work, rejects requests over the configured candidate limit, and passes bounds into SkillScanner.scan_directory:

max_candidates=MAX_BATCH_SKILLS
max_entries_visited=MAX_BATCH_PATHS_VISITED

The default values in the API are 100 candidate skills and 10,000 filesystem entries. The scanner then passes loader bounds into SkillLoader.load_skill, which means the per-skill load step is part of the same bounded execution path rather than an unbounded second phase.

The analyzer boundary changed too. llm_consensus_runs is capped in the API request models with Pydantic, and again in build_analyzers. The API no longer exposes a remote-callable Cisco AI Defense URL override; the analyzer factory can still use operator-controlled arguments and environment configuration, including AI_DEFENSE_API_URL, but the public request model does not let a caller pick the remote endpoint for the server.

Finally, scan endpoints now require X-API-Key backed by SKILL_SCANNER_API_KEY. /scan, /scan-upload, /scan-batch, and /scan-batch/{scan_id} all check it. The result cache for batch scans is also bounded: 1,000 entries, with a 3600 second TTL. The rate limiter is deliberately process-local, configurable through SKILL_SCANNER_API_RATE_LIMIT_REQUESTS and SKILL_SCANNER_API_RATE_LIMIT_WINDOW_SECONDS; that is useful for this server, but it is not a distributed quota system, and the docs should make that kind of caveat visible.

Tests and docs closed the loop

The branch did not stop at implementation. Tests were added or updated across:

tests/test_api_endpoints.py
tests/test_api_deep.py
tests/test_analyzer_factory.py
tests/test_loader.py
tests/test_scanner.py
tests/test_extractors.py
tests/test_cli_tui_api_fixes.py

The focussed verification command was:

uv run pytest \
  tests/test_api_endpoints.py \
  tests/test_api_deep.py \
  tests/test_analyzer_factory.py \
  tests/test_loader.py \
  tests/test_scanner.py \
  tests/test_extractors.py \
  tests/test_cli_tui_api_fixes.py \
  -q

On the current checkout, that collected 216 tests and returned 215 passed, 1 skipped on Python 3.13.13, with only third-party deprecation warnings. The process report also records a broader non-integration, non-LLM, non-e2e run at 1308 passed, 5 skipped, 7 deselected, plus ruff check . and git diff --check during the contribution.

The documentation updates matter because this is not only a code contract. .env.example, API docs, operations docs, endpoint detail pages, and generated reference docs now describe SKILL_SCANNER_API_KEY, SKILL_SCANNER_ALLOWED_ROOTS, rate limits, traversal limits, archive limits, batch limits, and the LLM consensus cap. A security control that exists only in code is easier to bypass operationally than one that is named in the configuration surface people actually read.

What this says about AI-assisted code review

The useful lesson here is not "AI found security bugs". That is too vague, and frankly not the interesting part.

The useful lesson is that AI-assisted review gets much better when the agent is forced to work from repository facts that can be rerun: symbol inventories, caller traces, callee traces, exact changed files, test names, and concrete verification commands. A model can read the most obvious route handler and sound convincing. A graph can show that the helper under discussion has four direct callers, or that a loader method has 92 direct callers, and that changes the review from opinion to coverage.

That is where sqry was valuable. It made the review faster, but the speed was not the main win. The main win was not having to trust a first-pass mental map of the codebase. The map was queryable, and when the map said the loader was shared across API, CLI, eval, scanner, and tests, the fix moved down into the loader. When the map said analyzer construction was shared, the consensus cap moved into the factory as well as the API request model.

This is also why I do not like abstract claims about "secure by design" unless the design names the boundary and the evidence. In this branch, the claims are more modest and more useful: API path access fails closed without configured roots; uploaded filenames no longer control staging paths; archive expansion has member, size, traversal, and symlink checks; batch discovery and scanner traversal have explicit limits; loader discovery has explicit limits; LLM consensus runs are capped at both the request and factory boundary; the focussed suite passes.

Those are claims a maintainer can inspect.

A note from adjacent SkillSpector work

The same pattern showed up while working through issues in NVIDIA SkillSpector: Stage 2 LLM batch failures, retry and concurrency behaviour, unanalyzed findings, ingest-layer bounds, and whitespace-padding detection all ended up being boundary questions. Different repository, different implementation, same shape of problem.

This is the part that feels important to me. AI-assisted development can help us ship faster, but faster shipping also means we can expose larger attack surfaces sooner: more API entry points, more archive and clone paths, more model calls, more background work, more places where a scanner accepts untrusted input. The answer is not to slow everything down by default; it is to make boundary review part of the shipping motion, with concrete limits, tests, and code-graph evidence before the surface gets too wide to reason about.

Takeaways

Start with the execution surface, not the file you happen to be reading. For this branch, the API surface crossed router, scanner, loader, extractor, analyzer factory, docs, and tests.
Use text search for strings, but use AST and graph search for structure. Same-name symbols across API, CLI, hooks, evals, and tests are not one behaviour.
Put limits where shared code is reached, not only where public requests enter. The loader trace is the obvious example here.
Make fail-closed behaviour explicit. SKILL_SCANNER_ALLOWED_ROOTS being absent means no API path access, not "scan whatever path was supplied".
Treat docs as part of the control surface. If an operator must set API keys, allowed roots, traversal caps, or archive limits, the docs need to say so in the places operators read.

Thanks for reading this far, I hope this is useful if you are hardening an API that wraps local filesystem work, archive extraction, or other expensive scanner-style behaviour. The bit I would reuse first is not any single line of code, it is the habit of asking the repository graph where the boundary actually runs before deciding where the fix belongs.

llm-cli-gateway 2.5.0: OAuth for remote MCP connectors and safer workspaces

Werner Kasselman — Mon, 08 Jun 2026 11:30:47 +0000

llm-cli-gateway 2.0.0 was the quiet supply-chain release. It moved persistence to Node's built-in node:sqlite, removed the production better-sqlite3 native install path, and made the package simpler to install and easier to audit.

That was intentionally not a flashy release. It was about removing risk.

The releases since then have been about the product surface: making the gateway easier for MCP clients to understand, keeping provider contracts current, adding a direct xAI API path alongside the existing Grok CLI provider, and now making remote MCP connector setup use OAuth instead of credential-shaped URL shortcuts.

The short version: llm-cli-gateway@2.5.0 is now published on npm, the GitHub release has signed installer artifacts, and the gateway has a safer remote-connector story than it had at 2.0.0.

2.5.0 adds OAuth for remote MCP connectors

The biggest change in 2.5.0 is the remote connector auth model.

The gateway now exposes public-ready MCP OAuth metadata and an authorization-code flow for remote MCP clients. That means clients such as ChatGPT custom connectors can discover the authorization server, request a code, exchange it for an opaque bearer token, and call the MCP endpoint without relying on a static bearer header pasted into a provider UI.

The setup shape is deliberately conservative:

static OAuth clients can be configured with hashed client secrets;
dynamic client registration is not open by default;
dynamic registration, when enabled, is gated by either explicit public-client policy or a shared registration secret;
shared secrets and client secrets are stored only as hashes;
secrets are never accepted in query strings;
generated client secrets are copy-once local output;
doctor, setup JSON, and default CLI output redact secret-bearing fields.

The practical result is that the public /mcp endpoint can support remote web connectors through OAuth while local bearer-token clients keep working.

The old ChatGPT no-auth URL path is deprecated

Earlier HTTP setup work created a separate high-entropy ChatGPT connector URL because ChatGPT connector setup could not rely on arbitrary static Authorization headers.

2.5.0 replaces that new-setup path with OAuth.

The current ChatGPT setup flow is:

llm-cli-gateway tunnel start
llm-cli-gateway oauth client add chatgpt --redirect-uri <ChatGPT callback URL> --print-once
llm-cli-gateway print-client-config

In ChatGPT, use the verified public /mcp URL with Authentication: OAuth, plus the authorization and token URLs from print-client-config or the setup UI.

The old high-entropy no-auth URL remains treated as deprecated compatibility surface only. New setup docs, the setup UI, and assistant runbooks no longer recommend it. Doctor output also redacts old persisted no-auth connector URLs instead of reconstructing them.

Workspaces are now registered aliases, not arbitrary paths

Remote MCP clients should not be able to browse or select arbitrary local filesystem paths. 2.5.0 adds a workspace registry so provider requests can target a named workspace alias instead.

The registry supports:

workspace aliases;
configured allowed roots;
default workspace selection;
provider request workspace input across sync and async request tools;
session metadata so a selected workspace can carry through provider-owned sessions;
workspace-aware async dedup keys, so the same argv in two different workspaces does not collide.

For local administration there are also workspace creation tools, but they are intentionally narrow. A workspace admin can create a new folder or initialize a new local Git repository under a configured allowed root. The gateway rejects absolute remote paths, traversal, denied directory names, symlink escapes, and existing non-empty targets. There is no network clone in this release.

That last point is important. This is not a remote filesystem browser and not a general "clone this URL into my machine" tool. It is a controlled local workspace registry.

Remote provider requests fail closed before spawning

The security invariant for 2.5.0 is simple: a remote OAuth-authenticated provider request must resolve to a registered workspace before any provider CLI is spawned.

That applies to the normal provider tools:

claude_request
codex_request
gemini_request
grok_request
mistral_request
the async variants

It also applies to codex_fork_session, which matters because forking a Codex session is still a provider spawn path.

Local bearer/stdin callers keep the existing local behavior unless they explicitly ask for unsafe workingDir or addDir values. Remote OAuth callers, by contrast, need an explicit workspace, a session-associated workspace, or a configured default workspace. Otherwise the gateway fails before the child process starts.

That closes off the bad fallback where a remote request silently inherits the gateway process cwd or ends up running in ~/.llm-cli-gateway.

2.4.0 still matters: direct Grok API and provider-owned sessions

The 2.5.0 release builds on the 2.4.0 product work.

2.4.0 added a separate direct API provider for xAI: grok-api.

This is not a transport flag on grok_request. It is a distinct provider type and a distinct tool, grok_api_request, because the API path has a different contract from an agentic CLI:

no sandbox or approval-mode flags;
no CLI process to spawn;
no grok local login requirement;
session continuity through xAI Responses API metadata rather than CLI resume flags;
API-only request parameters such as xAI Responses fields.

Configuration is isolated under [providers.xai]. The gateway stores the name of the API-key environment variable, not the secret itself. The tool is only registered when [providers.xai] is configured and the named environment variable is present.

Adding grok-api also forced a useful cleanup: stored gateway sessions are now owned by a provider, not treated as generic strings that any handler might try to resume.

The wider provider set now includes:

claude
codex
gemini
grok
mistral
grok-api

Wrong-provider session reuse is rejected across request handlers instead of failing later in a provider-specific way. A grok-api session should not be passed to grok_request, and a Codex session should not be passed to claude_request.

This is a boring invariant until it saves you from debugging a bad resume id at the wrong layer.

MCP tools are clearer and safer for clients

The 2.1.0, 2.2.0, and 2.3.0 releases were mostly about improving the MCP surface itself.

2.1.0 added Grok Build 0.2.32 support, including the leaderSocket parameter for grok_request and grok_request_async. It also improved upstream contract drift handling: the gateway can now distinguish hidden upstream flags from true missing flags, and it can acknowledge upstream-only flags that the gateway intentionally does not emit.

2.2.0 made all tools self-describing. Before that, clients saw tool names and schemas, but not much action-level description. Now the tool descriptions explain what each tool does, when sync requests can defer, why job_status differs from llm_job_status, and which tools are local-only.

2.3.0 added MCP tool annotations:

display titles;
readOnlyHint;
destructiveHint;
idempotentHint;
openWorldHint.

Those annotations let MCP clients build better confirmation UX. A read-only local status tool can be treated differently from a provider-spawning request that may cause an agentic CLI to modify files.

The important bit is not that the metadata exists. The important bit is that the metadata is tested as an invariant: exact read-only, destructive, and open-world sets are pinned, and contradictory read-only plus destructive annotations are rejected.

Resource URIs now use valid schemes

MCP Inspector caught a concrete interoperability bug in the resource surface.

The gateway had advertised resource URIs like:

cache_state://global
provider_subcommands://catalog

Those look readable to a human, but underscores are not valid in URI schemes. Standard URL parsing rejected them.

2.4.0 fixed the advertised resources to use hyphenated schemes:

cache-state://global
cache-state://session/{sessionId}
cache-state://prefix/{hash}
provider-subcommands://catalog
provider-subcommands://{provider}/{commandPath}

Legacy direct provider_subcommands://... reads are still accepted internally for compatibility tests and older direct callers, but standard MCP clients should use the advertised hyphenated forms.

After the fix, MCP Inspector successfully read every advertised resource: skills, sessions, models, metrics, cache state, provider subcommand catalog, and process health.

Provider subcommand contracts are visible

The gateway tracks upstream CLI contracts so it can reject unsupported flags before spawning a provider CLI. 2.4.0 extended the planning and resource side of that work.

There are now provider subcommand catalog and detail resources, plus tools for listing provider subcommands, reading a subcommand contract, and checking drift.

This is intentionally CLI-only. The direct grok-api provider is not a spawnable CLI and does not belong in the same subcommand contract path. That split is explicit.

The practical value: an MCP client can inspect the provider command surface instead of relying only on prose docs or hardcoded assumptions.

Host auto-upgrade operations landed

2.4.0 also added an operational path for machines that run the gateway as a local appliance.

The scripts/host-upgrade.sh flow stages npm releases into versioned directories, verifies the staged binary, applies upgrades atomically, and supports rollback. There are also user systemd service and timer units for scheduled upgrade checks.

This is not a replacement for the signed GitHub installer artifacts. It is for hosts where npm is the chosen install channel and you want a managed, reversible upgrade loop rather than an ad hoc global install command.

What changed from the 2.0.0 story

2.0.0 made the package safer to install.

2.1.0 through 2.5.0 made the gateway better to operate and easier for MCP clients to reason about:

Grok CLI support stayed current with upstream.
Tool descriptions and annotations now describe the real behavior of every MCP tool.
Direct xAI API access exists alongside the Grok CLI path.
Sessions are provider-owned, so cross-provider resume mistakes fail early.
Cache and provider-subcommand resources use valid URI schemes.
Provider subcommand contracts are inspectable through MCP.
Remote web connector setup now uses MCP OAuth instead of no-auth connector URLs.
Workspace aliases give remote clients a bounded way to select where provider CLIs run.
Local workspace creation is constrained to configured allowed roots and local git init.
Host upgrade operations have a staged and rollback-capable path.

The gateway is still what it has been from the start: one MCP endpoint that wraps provider CLIs and exposes durable jobs, sessions, validation, review, and provider orchestration.

The difference is that the surface is now less ambiguous. Clients can see which tools exist, what they do, how risky they are, which resources can be read, which provider owns a session, and which workspace a remote request is allowed to use.

That is the kind of functionality work that matters after the supply-chain story is handled. Fewer surprises at install time, fewer surprises at runtime.

Release evidence

2.5.0 shipped through the public mirror release path:

npm publishes with GitHub Actions provenance;
release installer artifacts are signed and uploaded;
public mirror CI, security, OpenSSF Scorecard, and CodeQL passed on the release commit;
the local release gate passed go test ./..., npm run build, npm run lint, npm run format:check, npm test, and npm run upstream:contracts;
the full test suite passed at 1,152 tests.

Links:

As always, MIT licensed.

Reviewing Patrick Collison's Ask for an LLM Workflow Tool

Werner Kasselman — Sun, 07 Jun 2026 01:21:41 +0000

Patrick Collison (https://x.com/patrickc) recently outlined the LLM workflow tool he actually wants.
I know pointing at my own work can read as self-promotion. I'm actually trying to stress test the production model I've been running under the vap umbrella in verivus-oss.
It lands right in that gap (and the evidence from real runs, including public X threads and the recent ledger distribution review, is there).

Patrick wants:

Ability to manage a set of input files (Markdown or similar), plus other general-purpose context.
Real-time collaboration, with some concept of snapshots or VCS integration.
The ability to create and manage inference workflows and a stored set of prompts.
Access to general-purpose coding agents (not just chat models).
Some concept of compiled outputs or inference results that can be shared externally.

He summarised the desired feeling as "GNU Autotools × Notion", a system for a body of material that you want to process iteratively, where certain artifacts are important enough to preserve, version, govern, and reason about across time.

The diagnosis is accurate. For many of us the generation bottleneck has moved. The dominant remaining problems are semantic state that survives many iterations and participants, coordination that doesn't collapse under mixed human and agent work, evidence that actually travels with the work, and governance that keeps intent explicit rather than dissolving into chat history or ad-hoc folders.

vap and the living studio surface

vap is the Verivus Assurance Platform, the umbrella under which the open verivus-oss work sits (and under which the deeper substrate in verivusai-labs is being built). The part that directly answers Patrick's friction is the living theatrical production studio, implemented as the agentassurance component.
Every body of work (a product, an initiative, even a single X reply series) becomes a zoomable Production inside the studio. The layout is the interface:

Productions live in the left sidebar as the hierarchy. I can sit at the full Verivus portfolio level or zoom down to a 22-unit DAG-TOML remediation plan. The same rules and ijbCRUD pane apply at every zoom.

Workspaces fill the centre: Storyboard for the typed DAGs that hold intent declarations, depends_on and blocks relations, acceptance criteria, and evidence requirements as first-class versionable artifacts; Scene for the current focused rehearsal; Explore for semantic cartography; Working On for the live messy iteration surface.

Exhibition sits in the right sidebar: the compiled outputs worth preserving and sharing, carrying full chain of custody.
Shared Resources run along the bottom (Props, Cast, Timeline), with Next in Line holding the queued pipeline.

The central operating verb across every layer is ijbCRUD, provenance-aware and evidence-backed by construction. Closure roots travel with the artifacts. Assertions live in the canon. This is what makes the state survive iterations and participants instead of collapsing back into chat or untrusted folders.

This is Autotools × Notion lifted into a full production process, grounded in DAG-TOML plus the Agent Assurance specification. Explicit intent, evidence via closure roots, cryptographic provenance, IJB assertions as substrate, runtime-neutral by design.

How it maps to the requirements

Input files plus general-purpose context: sqry (the semantic/living graph and memory layer). Soon to be called scrub on integration.
Real-time collaboration plus snapshots/VCS: weave (CRDT multi-actor rehearsal system with structural operations) plus ledger (evidence-rich semantic episodes that replace brittle file/branch/commit records).
Stored prompts plus inference workflows: storyboard (Director’s Planning Board) using typed DAGs as first-class artifacts, dependencies, acceptance criteria, evidence requirements, tiered ranking, and status all explicit.
General-purpose coding agents: agentfederator (Casting Director). Deliberate multi-LLM routing, frontier models for high-intent planning, quantized open models on capable hardware for execution velocity and cost.
Compiled outputs that can be shared: Exhibition layer (ledger episodes plus structural codec plus assurance substrate for provenance, signing, and attribution).
Supporting roles round it out: ingestor, ijb (the Master Script/Canon), arctos (Production Runtime), bulwark plus vault, and meter.

What's live under verivus-oss today

I've already run this evolving model in public: X replies and crossposts with multi-LLM consensus and evidence traces, the ledger distribution review governed by a living 22-unit DAG-TOML plan.
The plan and its evidence became the shareable Exhibition record. sqry itself has been used in real audits. Earlier articles and repo briefs on dag-toml and the production model are out there too.
These are real, usable artifacts.

What's still behind the curtain

The fuller vision lives in the internal verivusai-labs work under vap: the complete substrate (ijb, vault, ledger, integrity, meter and related crates), deeper studio refinements, and day-to-day use on larger efforts. I'm surfacing pieces as they stabilise. The published verivus-oss artifacts are the current on-ramp. This is nights and weekends alongside the day job, completely disconnected, with learnings feeding one way only (#ihaveadayjob).

Invitation

If Patrick's description matches the friction you feel doing serious long-running agentic work, context that survives iteration, workflows that are versioned and governed, agents deliberately cast, outputs that can be exhibited with real provenance, this is the direction under vap in verivus-oss.

The published artifacts are the on-ramp.

Concrete experiments (running sqry on a real stack, authoring a typed DAG, using the pipeline for output) and precise evidence-based feedback on what would make you want to direct or act in a real Production are especially welcome.

Repo links and contact in profile. Early collaborators willing to engage the ontology and run real Productions are welcome.
The underlying conviction is that tools of this kind function as cognitive co-processors, common grace that removes a significant portion of the grinding burden of semantic entropy and coordination so the remaining human work can be higher-order direction and faithful stewardship of Productions.

The City-State and the Federation: Two Governance Models for AI Coding Agents

Werner Kasselman — Thu, 04 Jun 2026 11:04:56 +0000

Why I am writing this

This is the third piece in an accidental series about convergent evolution in agent tooling, and I think it is the most useful one, because this time the two systems being compared are not merely neighbours in the same field, they are the same species of thing: governance systems for AI coding agents, built in the same quarter, by people who have never spoken, with overlapping mechanisms and almost perfectly complementary blind spots. In the first article I described my DAG TOML stack, plans as machine-checkable claims with validators and a fleet control plane behind them, and in the second I compared two orchestrators. This one is about dgov by James H. Gearon, which describes itself as a "deterministic kernel for multi-agent orchestration via git worktrees". I should be straight about my method: I did not read the source line by line myself. I had my agents clone it and do the close reading (roughly 20,000 lines of Python across 70 modules, with 70 test files and a benchmarks document) and I worked from their structured analysis, the project's own documentation and the schema excerpts they pulled, which, given the subject of this article, feels less like a shortcut and more like a demonstration.

The usual disclaimer applies, doubled: I built one of the two systems, I have neither run nor personally read the other end to end, and any misreadings of dgov are mine (or my agents', which contractually is still mine). Take this as one practitioner reading a rival constitution with admiration, a highlighter and a research staff, nothing more.

Two metaphors, both load-bearing

The first thing that struck me reading dgov is that it is built on a legal metaphor, and the metaphor is structural rather than decorative. There is a governor charter (governor.md, "Plan first. Respect file claims. Fail closed."), standard operating procedures as statute, an append-only ledger whose entries include a category literally called case law, prompt sections injected into workers under the heading of probation, an error type named ConstitutionalViolation, and ten documented design pillars covering separation of powers and fail-closed defaults. The probabilistic worker implements; the deterministic governor plans, validates, reviews and merges. It is a constitution with an enforcement arm.

My stack runs on a different metaphor, scientific audit: plans are claims, validators attempt to refute them, completion requires evidence, and a control plane above many repositories evaluates everything against policy. Law versus science, enforcement versus refutation. Both metaphors earn their keep, and the differences between the two systems fall out of the metaphors with surprising neatness.

What a plan is

In dgov, a plan is a TOML tree compiled to a DAG, and each task carries it's own prompt, the actual work order, alongside file claims (files.create, files.edit, files.read and so on), dependencies, a test command, a role (worker, researcher or reviewer), an iteration budget and a set of tag-matched SOPs that get prepended to the prompt. The plan is directly dispatchable: compile it, and workers in isolated git worktrees start executing it. Compilation is fail-closed, cycles and unreachable units and malformed sections are rejected before anything runs.

In my stack the plan deliberately contains no prompt at all. A unit carries contracts instead: acceptance criteria, constraints, failure modes, critical decisions, produced and consumed artefacts, and a [computed] section in which the author must commit to derived claims (critical path, per-layer parallelism, totals) that a validator independently recomputes and diffs. The plan is not a work order, it is a reviewable artefact that can be refuted before anyone executes it.

So dgov closes the loop from plan to execution, and mine closes the loop from plan to review, and neither closes both. That asymmetry runs through everything else.

The thing dgov does that I do not

Credit first, because this is the part that made me sit up. At settlement time, dgov diffs the worktree and compares the files an agent actually touched against the files the task claimed it would touch, and the comparison is merciless: unclaimed paths reject the merge, reserved paths fail closed, and even reading outside the declared read scope is caught and surfaced. Git is the source of truth, and the claim is checked against reality mechanically, every time, with no human in the loop.

I have to concede this carefully, because the first draft of this paragraph conceded it wrongly. My plan runtime does not do that: my validators refute a plan's self-consistency (a declared critical path that is not the longest path fails, an artefact with two producers fails), and my evidence matrices require completion claims to name a proof with declared scope and known exclusions, but when a unit is marked done, nothing mechanically diffs the declared file claims against what actually changed. The honest complication is that the mechanism does exist elsewhere in my stack: my version-control layer, aivcs, records the symbols actually touched in each Episode and attaches evidence with a freshness lifecycle, which is claim-versus-reality binding at symbol granularity, finer than dgov's file granularity. What I am missing is not the mechanism, it is the wiring: the plan runtime and the version-control layer do not yet check each other. dgov verifies what happened against what was claimed in one continuous motion; I have both halves of that theorem proved in separate buildings. Those are different failure modes, and his is the better one.

dgov has two more mechanisms worth respecting. Its semantic settlement layer does AST-level analysis of integration candidates before merging, with a failure taxonomy of its own (text conflicts, concurrent edits to the same symbol, duplicate definitions, signature drift, ordering conflicts, and a category called behavioural mismatch), which I found quietly delightful, because building a failure taxonomy and then mechanising it is exactly the move my whole stack came from, except he aimed it at merge integration whilst I aimed it at review iteration. I will come back to that taxonomy below, because when I checked it against my own cupboard the comparison surprised me in both directions. And the kernel itself is a pure function from state and event to new state and actions, no I/O, explicit dispatch table, everything event-sourced to SQLite and an append-only deploy log, which means a run is deterministically replayable in a way my live-database runtime is not. There is even an autofix phase (mechanical lint fixes applied before the validation gates run), which saves the expensive kind of retry where an agent burns an iteration fixing a formatting complaint.

The thing I do that dgov does not

The complementary gaps are just as clean. dgov has no recomputable derived claims, so a plan whose declared structure is internally wrong in ways a topological check cannot see (an inflated parallelism story, a schedule that ignores the true critical path) executes anyway. It has no artefact dataflow, no produces and consumes with single-producer ownership, so the failure class where two units quietly both own the canonical definition (the one that once cost me thirteen review iterations) has no mechanical guard. Its reviewer role is explicitly bounded to the diffs of dependency tasks, one model provider, no multi-model adversarial review, where my process was born precisely from independent reviewers (Codex, Gemini and Claude) disagreeing productively. Its acceptance story is a test command's exit code, and as I wrote in the first article, half of my December pain came from tests that existed but could not fail, which is exactly the weakness an exit-code gate cannot see and an evidence matrix with known exclusions is built to catch.

And dgov is constitutionally a city-state. One repository, one .dgov/ directory, one governor. It governs its territory completely and stops at the border. My control plane is the federation layer: policy packs and requirement profiles defined once, per-repository agents pushing signed snapshots, evaluation history, exception lifecycles, release trains across many repositories. dgov has no analogue, and frankly does not claim to want one, but the moment you run agents across a fleet the federation question arrives whether you invited it or not.

The convergence list grows

With the previous article's comparison included, there are now three solo builders (wpank with Bardo, Gearon with dgov, and me) who independently arrived at: declarative task units with explicit dependencies, file claims per task as the precondition for safe parallelism, fail-closed validation before execution, topological ordering, per-task verification commands, an append-only event history, and failure memory carried forward into future attempts (his ledger case law, Bardo's do-not-retry lists, my deficiency taxonomy). One small coincidence I cannot resist recording: the day dgov's git history was re-bootstrapped for worktree isolation is the same day I authored my first DAG TOML. Nothing connects the two events except the season, which is rather the point.

When isolated builders keep meeting at the same mechanisms, the mechanisms are telling you something about the problem, not about the builders. File claims, fail-closed gates and forwarded failure memory now look to me like the arch and the keystone of this field, the parts every serious system will have because the load demands them.

What I am taking home

I finished reading dgov with a shopping list, which is the highest compliment I know how to pay another person's codebase:

Claim-versus-reality settlement in the plan runtime. My runtime should refuse to mark a unit done while the actual touched files disagree with the unit's declared file sets, exactly as dgov's review sandbox does, and since my version-control layer already records touched symbols and attached evidence, the work here is plumbing rather than invention. Still the single highest-value import.
The placement of merge analysis, not the taxonomy itself. My first draft of this list said I should import his merge taxonomy, and then I went and audited my own shelves: my semantic merge engine already covers his categories and more (manifest-driven conflict policy per language, tiered degradation down to plain git merge when parsing fails, and a commutativity algebra that formalises what he calls ordering conflicts), and my code-graph layer detects signature drift and duplicate definitions independently. What dgov actually taught me is where to stand: he runs merge analysis as a settlement gate inside the plan runtime, every task, every time, whilst my deeper machinery sits in a separate layer that the plan runtime never consults. The import is the wiring, his architecture carrying my components.
Fail-closed policy parsing. dgov rejects malformed SOPs at compile time, required front matter, required sections, no exceptions, and my template ecosystem should hold its own policy documents to the same standard it already holds plans.

And one observation rather than an import. The most interesting entry in his failure taxonomy is behavioural mismatch, the case where two changes merge cleanly and disagree only at runtime, which is exactly the failure I wrote up in an earlier piece (a pricing path quietly depending on a field another agent had removed, both sides compiling, both passing their tests, git merging without a murmur). dgov's taxonomy names that crime but cannot yet detect it, because detection needs a relationship graph (which callers depend on which symbols) rather than a diff, and that graph is precisely what the symbol-indexing and predicate layers of my stack exist to provide. The city-state names the crime; the federation has the forensics. Neither system has secured a conviction yet, and I suspect whoever gets there first gets there with both halves.

If Gearon ever reads my side of this, the reciprocal list is above: refutable derived claims, artefact ownership, evidence with declared exclusions, and a story for the day dgov needs to govern more than one city. And since the comparison should be checkable rather than taken on trust, my side of the format is a public draft specification at agent-assurance.dev, with independent Rust, Go and Python validators, should anyone (including him) want to implement against it.

Thanks for reading this far, I hope you find some value in the comparison. If you are building agent governance of your own, whether it leans towards law or towards science, I would genuinely like to hear which theorems you chose to prove mechanically, and which ones you are still taking on trust.

The Machine That Builds the Machine, and the Studio That Runs Itself: Two Ways to Organise an Agent Swarm

Werner Kasselman — Thu, 04 Jun 2026 10:12:18 +0000

Why I am writing this

I thought people might find this comparison useful, because it is rare to get two fully built agent-orchestration systems, designed in complete isolation from each other, solving the same class of problem with enough written detail on both sides to compare them honestly, and rarer still to catch the differences while both are still warm. Shortly after publishing my DAG TOML article I went looking for neighbours and found wpank's write-up, Building the Machine That Builds the Machine, which describes Bardo: a meta-system that takes a 234,657-line specification across 343 files and turns it into 26 compiled Rust crates through coordinated agent swarms. I have my own horse in this race, a system called atelier-studio (roughly 80,000 lines of Rust, built across about five months), and reading his post was the strange experience of recognising my own decisions in a stranger's codebase, and then, more usefully, recognising the places where he and I made opposite calls.

I am not a neutral reviewer here, I built one of the two systems being compared, so please take this as nothing more than one practitioner reading another practitioner's work with respect and an honest ruler. Where I describe Bardo I am working from the write-up alone, not the code, and any misreadings are mine.

The factory: Bardo

Bardo is project-shaped. It exists to finish one enormous build: a 26-crate Rust workspace implementing autonomous agents with mortality, dreaming, emotion and economic incentives, specified down to the academic citations (467 of them, Hans Jonas on metabolic freedom and Damasio's somatic markers, to name a few). The orchestrator, bardo-ctl, is 42,744 lines of Rust, and the part I admire most is around 2,000 lines of bash.

The bash is a three-stage context engineering pipeline, and frankly it is the heart of the whole design. Stage one extracts specification sections using a two-source weighted model (inline spec references get double weight over crate-mapped directories). Stage two decomposes a plan into ordered steps under a 102.4KB context cap, with the rule that each step must compile when combined with all previous steps. Stage three distils each step down to a 5 to 15KB context slice, carrying forward a one-line summary of what previous steps accomplished, so the agent implementing step 7 never sees the scaffolding from step 1. The design came, in his words, from watching agents drown in 80KB payloads where maybe 12KB was relevant.

Above that sits a genuinely complete orchestration layer: around 100 task TOML files declaring files, acceptance criteria, cross-plan dependencies (a task can depend on "17:T1", task T1 of plan 17, which lets the scheduler extract parallelism across plan boundaries) and exclusive file claims; a dual-layer DAG with wave scheduling via Kahn's algorithm; a next_runnable() check that refuses to start any task whose files overlap an in-flight task; 25 agent roles routed to three backends by competence (Codex for refactoring and diagnosis, Cursor for review verdicts, Claude for orchestration and implementation); a gate gauntlet (compile, dependency-deny, test, spec compliance) with a three-failure halt; a parallel three-reviewer panel synthesised by a Critic; git worktrees per plan with a shared sccache so parallel builds cache-hit each other; and a Conductor that nudges silent agents at 300 seconds, restarts stalled ones at 600, and never lets itself starve an Implementer of a spawn slot.

Two smaller mechanisms deserve a nod because they encode real scars. The iteration memory builds cumulative DO NOT RETRY lists from compiler errors and review blockers, born from watching an agent hit the same type mismatch four iterations running, each time "fixing" it differently and wrongly. And the golden-path index records plans that succeeded on the first attempt, categorised, so future decompositions are shown up to two worked examples of the same category. Failure memory and success memory, both fed forward.

The studio: atelier-studio

Atelier-studio is institution-shaped. Where Bardo exists to finish a build, atelier exists to keep running: a set of standing councils (research, engineering, QA, go-to-market, product and operations) that take a product idea through the whole lifecycle, from market analysis and competitive intelligence through work package decomposition, test planning, service level objectives and launch messaging, backed by a local knowledge graph of around 23,000 ingested items (papers, standards, bodies of knowledge, model registries).

The design bet is different, and the difference matters. Bardo diversifies it's agents by skill, routing each role to the backend best at that job. Atelier diversifies by perspective: each council runs multiple independent planner "flavours" against the same inputs, a Conservative Analyst worrying about risk and compliance, an Optimistic Explorer chasing emerging technology, a Pragmatic Synthesizer weighing cost against time to market (the engineering council has its own trio along minimalism, scalability and maintainability lines), and the outputs are merged through critique and ranking rather than simple voting. Bardo never argues with itself. Atelier is built to argue with itself, because in business strategy work the failure mode is not a type mismatch, it is a confident plan that nobody stress-tested from a hostile angle.

The memory systems differ the same way. Bardo's learning is textual and rule-shaped, DO NOT RETRY lists an agent must read. Atelier's is statistical: an attempt tracker feeding a failure oracle that forecasts the probability the next attempt fails (Dirichlet modelling), and a calibration tracker (isotonic regression and Platt scaling) that keeps the system's confidence honest against its actual hit rate. One remembers what failed, the other models how likely failure is. Atelier also crosses a line Bardo never attempts: a self-improvement subsystem that proposes changes to atelier's own code, which is exactly why it carries a human-approval safety gate and adversarial review, because a system that rewrites itself needs governance in a way a build factory does not.

Where two strangers built the same parts

The convergence list is long enough that I stopped finding it spooky and started finding it instructive. Both systems independently arrived at: atomic work units carrying their own acceptance criteria and file sets; explicit dependency DAGs over those units; file-level conflict detection as the precondition for safe parallel agents (Bardo's exclusive-files check is functionally identical to the conflict groups in my DAG TOML runtime); a panel of reviewers with a synthesising verdict; a three-strikes failure budget; failure memory fed forward into the next attempt; success exemplars fed forward as worked examples (his golden paths are, almost word for word, the clean one-pass approvals I used as a negative class when mining my review archive); and isolation of parallel writers via separate working copies.

None of this was copied. I found his write-up after building mine, his post does not reference any of my work, and yet the load-bearing safety mechanisms match almost one for one. When two builders who have never met converge on file-level conflict detection and cumulative do-not-retry memory, that is not fashion, that is the problem itself dictating the shape of the solution, the same way every culture that builds bridges discovers the arch.

Where the philosophies split

Three genuine divergences, and each one traces back to the shape of the work rather than to taste.

First, static distillation versus living retrieval. Bardo can precompute context slices because the specification is frozen; the spec is the territory and the pipeline is a map-making exercise done once. Atelier cannot freeze anything, the knowledge graph keeps growing and the councils query it at run time through a librarian layer with per-council token budgets. Bardo compiles context, atelier retrieves it. His closing line, that context engineering is the whole game, the right 12KB delivered at the right time, is the frozen-world statement of the same conviction that made me build the knowledge graph for the unfrozen one.

Second, skill diversity versus perspective diversity, which I described above and will not repeat, except to note the consequence: Bardo's review panel exists to catch defects, atelier's flavour consensus exists to catch blind spots, and a mature swarm probably needs both.

Third, the cockpit versus the control plane. His attempt at headless operation was, in his words, like driving blindfolded, an agent stuck in a compile-fix loop for 15 of 20 unobserved minutes, and his answer was a terminal dashboard with 26 widgets, pause and force-advance controls, and per-role colour coding. My answer to the same pain was structured event streaming and, eventually, an external control plane that evaluates fleet state from data rather than from watching. An interactive cockpit against a queryable instrument panel, and I suspect his converts stuck agents into intervention faster, whilst mine scales past the number of screens one person can watch.

What I take from it

The safety mechanisms converge, the strategy layers do not. Conflict detection, acceptance criteria, failure budgets and iteration memory showed up in both systems unprompted, whilst context strategy, diversity strategy and observability strategy split cleanly along the grain of each system's purpose. If you are building an orchestrator, copy the first list with confidence and choose the second list deliberately.
Project-shaped and institution-shaped systems want different memory. A factory can carry it's lessons as text, an institution needs calibration, because the institution will still be making forecasts long after any individual lesson has gone stale.
Context engineering keeps winning. Two systems, opposite architectures, same conclusion: not better models, not longer windows, but the right small context at the right moment.
Synchronicity is evidence. When isolated builders keep meeting at the same mechanisms, those mechanisms are probably load-bearing for the whole field, and they are the parts I would now least want to be without.

Credit to wpank for a write-up generous enough with internals to make a real comparison possible, that generosity is rarer than the engineering. Thanks for reading this far, I hope you find some value in my reading of the two machines. If you have built your own orchestrator and recognise these mechanisms (or, better, if you made a third set of choices entirely), I would genuinely like to hear how the wall pushed back on you.

DAG TOML: How I Turned Four Months of Code-Review Pain into a Machine-Checkable Planning Format

Werner Kasselman — Thu, 04 Jun 2026 08:47:24 +0000

Everything below is date-anchored, because the dates matter to the story: I first put agent rules in TOML in October 2025, the failure data runs from December 2025 to March 2026, the first DAG TOML was authored on 2 April 2026, the archive analysis that justified it ran on 4 April 2026, and the database-backed runtime followed across April and May 2026.

Why I am sharing this

I thought people might find this interesting, and hopefully it saves somebody else a few wasted review rounds, because the cost of the problem I am about to describe is mostly invisible until you sit down and add it up. I run a multi-agent development process where LLM agents (Claude, Codex CLI and Gemini CLI, to name a few) plan, implement and cross-review each other's work on a Rust codebase, and every work product goes through independent review by at least two different model families before it merges.

I am not a process-methodology researcher and I have no business publishing failure taxonomies, so please take this as nothing more than me sharing what I found in my own review archive, and what I changed because of it.

The system works, frankly better than I expected when I started, but through late 2025 it had a churn problem: work kept bouncing back for rereview, and every bounce burned a full review round across multiple models. So in April 2026 I did something slightly unusual, I treated my own review archive (roughly 2,400 review documents) as a dataset and asked the obvious question: why does work actually bounce?

This article shows one real chain from that dataset (the December one), the taxonomy that fell out of the analysis, and the fix: implementation plans written as TOML DAGs with mechanical validators, so that an entire class of review findings became exit 1 instead of a week of iteration.

Exhibit A: the project-persistence chain (5 and 6 December 2025)

The feature was unglamorous: persist a code-index project's in-memory state (repo index, file table, symbol index) to disk on teardown and reload it on startup, the kind of thing that should be a one-pass review.

The paper trail, fully dated:

5 December 2025 - Spec written and approved, with a full planning pack behind it: spec, design, implementation plan and test plan. Concrete targets: warm restore after restart, persist in under 750 ms for a 50k-symbol index, at least 80% module coverage.
5 December 2025 - Nine pre-implementation review iterations across three models (3 by Codex, 2 by Gemini, 4 by Claude) before a single line of code was written.
6 December 2025 - Implementation done. Two independent post-implementation reviews. Both returned REQUEST CHANGES.
6 December 2025 - Fix iteration, second review round, approved the same day.

What did two reviewers find on 6 December, after all that planning?

Severity	Finding
HIGH	The restore path overwrote every file's repo ID with `NONE`, the persisted ID was simply ignored, so reloaded state was detached from its repositories. The feature's entire purpose silently didn't work, and a `TODO` in the code acknowledged it.
HIGH	The cache directory from config was trusted verbatim, which meant absolute paths and `..` segments could write state outside the project root. Path traversal, despite the spec explicitly constraining writes to the project root.
MEDIUM	The config fingerprint (used to invalidate stale persisted state) hashed only 4 of the 7 config fields that affect indexing, so changing the others silently reused stale state.
MEDIUM	The "concurrency test" spawned four threads on four separate directories. Same-root races: untested.
MEDIUM	No test ever persisted and restored an actual symbol index, so the headline requirement was unverified.
LOW	The file was fsynced but the containing directory was not, so a crash after rename could lose the file after logging success.

Both reviewers, independently and from different model families, converged on the same top finding. The second round on 6 December fixed everything with a verification table mapping each finding to specific code and a named test, and it was approved same-day.

Here is the uncomfortable part: the planning was thorough, the planning reviews were thorough, and the implementation still shipped with it's core feature non-functional and a path-traversal hole. Plans written in prose don't bind implementations, and reviews of prose can't be rerun.

Mining the archive (4 April 2026)

I analysed seven full "iteration chains" (initial request, blocking reviews, rereviews, final approval) spanning December 2025 to March 2026, plus nine clean one-pass approvals as a control group:

December 2025 - project persistence (above); plugin polish across 4 language plugins ("production-ready" claimed whilst the test matrix said otherwise); and a follow-up where the tests existed but couldn't fail, because non-strict assertions passed even with the feature absent
December 2025 to January 2026 - a privacy-sensitive planning pack that took 13 iterations, mostly because no single canonical schema existed early and definitions drifted across documents
10 February 2026 - a policy standard blocked on MUST/SHOULD conflicts and a precedence model that let task instructions override security controls
February 2026 - a C++ language feature claiming "complete support" whilst its own status docs still described failing tests
10 March 2026 - a planning pack that burned review rounds 6 and 7 on a missing artefact family and an "ordering is deterministic" claim with no stated ordering rule

Every rereview cause fit one of six categories:

Missing artefact completeness - required docs absent, found by the reviewer
Unstated contracts - "deterministic", "compatible", "safe", with no rule written anywhere
Drifted contracts - the same concept defined differently across documents
Evidence gaps - claims broader than tests, and "resolved" without proof
Boundary rules missing from the design - no privacy, security or filesystem constraints stated
Boundary rules stated but not enforced - the December path-traversal case, exactly

And the clean one-pass approvals (all nine of them) shared four traits: bounded scope, already-explicit contracts, evidence matched to claims, and reviewer comments that were refinements rather than prerequisites.

Notice what the six categories have in common: almost none of them are code bugs. They are plan-shaped defects, and they are checkable before a reviewer ever looks.

The fix: plans as DAGs, in TOML, with a validator (2 April 2026)

The first DAG TOML was authored on 2 April 2026, and the extracted templates and validators followed on 4 April, the same day as the archive analysis. TOML itself was not new to me, I had been putting agent rules in TOML since 12 October 2025 (a [rules] never/always prompt policy in one of my Rust projects, with trigger-activated context sections and token budgets), but all through the December-to-March churn the plans themselves stayed in prose, and April was when the plans became TOML too. I know that a TOML schema for plans might sound like process for the sake of process, but the format makes every plan claim one of three things: a required field, a recomputable assertion, or a gated state transition.

A plan is a set of units:

[units.U02]
name = "extract-initial-chain-set"
layer = 1
tier = 1
status = "done"             # pending | in_progress | done | blocked | deferred
depends_on = ["U01"]
blocks = ["U04"]
estimated_loc = 160
files_modify = ["research/ANALYSIS_FINDINGS.md"]
acceptance = [
  "At least five completed chains are analysed with explicit rereview causes.",
]
produces = ["ART:initial-chain-findings"]
consumes = ["ART:batch-scope"]
critical_decisions = ["Distinguish content defects from process defects."]
constraints = ["Only count deficiencies that materially forced another iteration."]
failure_modes = ["If extraction drifts into generic summaries, the taxonomy loses causal value."]

acceptance, constraints, failure_modes and critical_decisions are required, per unit. Category 2 (unstated contracts) stops being something a reviewer must notice by absence, it becomes a missing required field.

Then the plan must declare its own derived properties:

[computed]
entry_points = ["U01"]
leaf_nodes = ["U05"]
critical_path = ["U01", "U02", "U04", "U05"]
critical_path_loc = 420
[computed.max_parallel]
layer1 = 2

And here is the entire trick: a roughly 500-line Python validator (standard library only, tomllib does the parsing) recomputes every one of those claims from the units table and diffs them.

blocks must be the exact inverse of depends_on, so editing one side of a dependency and forgetting the other fails validation with the exact mismatch
cycles are detected and printed as the actual cycle path
every ART: artefact must have exactly one producer, so the "who owns the canonical definition" drift that cost 13 iterations in January becomes a one-line error
every consumes must match an existing produces, so hidden dependencies surface as holes in the plan
a depender must sit in a strictly higher layer than its dependencies, so overstated parallelism fails
the declared critical path must be a chain of real edges, start at an entry point, end at a leaf, and match the true longest weighted path (recomputed via toposort), so schedule fantasy fails
units sharing files must be declared in conflict groups, so two parallel agents about to edit the same file is caught at plan time
files_modify paths must exist in the repo, so plans written against an imagined codebase fail
placeholders (<fill-in-later>) are rejected outright

A wrong plan claim is no longer a reviewer judgement call, it is a failed assertion with a one-line diff.

What it changed in review (4 April 2026, first live use)

Two days after the format existed, the first DAG-reviewed plan went through: a plugin cost-tiering feature. The reviewer's scope line was the TOML file itself, and the verdict was APPROVED in one pass with zero blocking issues, where all four reviewer comments were genuine domain risks (legacy manifest fallback semantics and plugin ID stability, to name a few) rather than structural gaps.

That is the mechanism working as intended: the structural questions reviewers used to burn rounds on, is anything missing, do the dependencies make sense, what can actually run in parallel, does the timeline claim hold, are pre-answered by validator before the review is even requested, which leaves the reviewer's whole attention for the hard semantic findings, and frankly that is the only thing humans and frontier models should be spending review rounds on.

And the December bug class? Gates and evidence matrices

To be clear, the DAG validator alone would not have caught the December path traversal, the reviewers did that, and that finding is category 6 (boundary stated but not enforced in code). Two companion formats target it:

Contract declarations - any plan touching filesystems, ordering, compatibility or fallback must declare the contract explicitly (path-root confinement, traversal handling, atomicity), and each contract names what verifies it.
Evidence matrices - a "finding resolved" or "feature complete" claim must bind a claim ID to an evidence path plus declared scope plus known exclusions, and the validator checks the evidence file actually exists. You mechanically cannot say "resolved" without naming a proof that could fail, and if you remember the December tests that couldn't fail, that is exactly the failure mode this kills.

The December chain's second review (the one that passed) was already an informal evidence matrix, every prior finding mapped to specific code lines and a named test. The format just makes that table mandatory, machine-checked, and required before the review is requested instead of produced during round 2.

Where it went next (April and May 2026)

Static validation only catches problems when someone runs it. In April and May 2026 the same four invariants moved into a database-backed runtime, where agents import the TOML once and all state lives in the database:

a unit is only offered to an agent when every dependency is done
status changes are guarded transitions with history, not string edits
the inverse-edge, single-producer, consumes-has-producer and layer-ordering invariants are enforced at mutation time
readiness gates are a query, "is this bundle reviewable?", answered from data before a review request is ever sent

A nod to the neighbours

After publishing the first version of this piece I went looking for who else had walked this road, and the honest answer is that I was not alone, and in some respects I was not first either. gptme (Erik Bjäreholt's terminal agent) was putting agent context and workspace configuration into a project-level gptme.toml long before I wrote my first agent rule, and its agent workspaces (tasks, journal, lessons, all git-tracked) are a thoughtful take on the same persistence problem my runtime addresses. lok defines declarative multi-backend LLM workflows in TOML, [[steps]] with depends_on, retries and consensus thresholds, which is DAG-in-TOML for orchestration, done cleanly. dgov (James H. Gearon) is the closest cousin of the lot: TOML plan trees with task dependencies, compiled to DAGs and dispatched to agents in isolated git worktrees with settlement gates on the way back in. The Bardo write-up ("Building the Machine That Builds the Machine") describes 115 dependency-chained plans and around a hundred task TOMLs feeding agent swarms, the same shape at a scale that makes mine look modest. And aura from the Mezmo team composes whole agents from declarative TOML.

What strikes me most is the synchronicity of it. None of these projects reference each other, and I found them only after building mine, yet several teams independently reached for the same move within the same season: take the parts of agent work that used to live in prose and conversation, and push them into a declarative, diffable, machine-readable format. I do not think that is coincidence, I think it is convergence, because anyone running agents at volume eventually collides with the same wall (plans and claims that read beautifully and bind nothing), and TOML happens to sit in the sweet spot of human-writable and machine-checkable. Credit where it is due to all of these teams for getting there on their own paths. If my contribution adds anything on top, it is the validator-first posture: not just expressing the DAG in TOML, but making the plan declare claims that a validator can independently recompute and refute.

Takeaways

Your review archive is a dataset. Seven failure chains and nine clean approvals were enough to find six stable failure categories, and they were stable across different reviewer models, which was the signal that they were real.
Most rereview causes are plan defects, not code defects. Plans in prose can't be validated, plans as data can.
Force derived claims, then recompute them. The [computed] section is the idea that pays for everything else here, because making the author commit to parallelism, critical path and totals turns optimism into a checkable assertion.
"Resolved" must name a proof that could fail. Half of December's pain was tests that existed but couldn't catch the bug they claimed to cover.
Spend reviewer rounds only on what machines can't check. After the switch, my first DAG-reviewed plan went through in one pass, with the reviewer's whole budget spent on real domain risk.

The format described here is no longer internal: DAG-TOML is now a public draft specification at agent-assurance.dev, with independent Rust, Go and Python validators, worked examples, and profile extension points, released under the verivus-oss/agent-assurance repository. The database runtime and the fleet control plane remain internal for now, but the schema ideas (required contract fields, recomputed [computed] sections, single-producer artefacts, evidence matrices, closure roots) are all in the spec, and you can validate a file against it today.

Thanks for reading this far, I hope you find some value in my story. If you have mined your own review archive (and specifically the rereview causes), I would genuinely like to hear what categories you found.

llm-cli-gateway 2.0.0: the quiet supply-chain release that matters

Werner Kasselman — Thu, 04 Jun 2026 08:26:01 +0000

llm-cli-gateway 2.0.0 went out on 4 June 2026. npm now reports 2.0.0 as the latest version, and the public GitHub release carries the platform binaries, bundled installers, SHA256 checksums, release manifest, and Sigstore bundles.

The headline change is simple: production persistence no longer depends on better-sqlite3. The gateway now uses Node's built-in node:sqlite, behind a single adapter in src/sqlite-driver.ts, and that one architectural change removes an entire class of install-time supply-chain risk from the consumer tree.

That matters because the recent 1.17.x work was not really about SQLite as a database. It was about the native-module install path around better-sqlite3, specifically the prebuild-install, tar-fs, and tar-stream chain. In 2.0.0 that chain is not patched, worked around, or hidden behind an advisory. It is absent from production installs. The release verification now asserts that consumers get no better-sqlite3, no prebuild-install, and no tar-stream in the installed tree.

The cost is a real breaking change: Node >=24.4.0 is now required. That is not arbitrary. The gateway's persistence layer binds plain objects like { id: ... } to @id SQL placeholders, and Node 24.4 is the point where node:sqlite has the bare named parameter behaviour this code relies on. The test suite pins that behaviour so future changes fail loudly rather than turning into quiet persistence bugs.

The adapter itself is intentionally small. openDatabase, openReadOnly, GatewayDatabase, and GatewayStatement are now the surface area, with flight-recorder.ts and job-store.ts using that surface instead of touching SQLite directly. The release security audit enforces that node:sqlite is referenced only by the adapter, which keeps the persistence boundary clear and reviewable.

There is one security detail in the read-only path that I particularly like. queryRequests now opens a dedicated read-only SQLite connection, so row mutations fail at the SQLite engine level with SQLITE_READONLY. During review, one exception was found: VACUUM INTO can create a new file even on a read-only connection. The adapter now rejects VACUUM and VACUUM INTO on read-only connections, including comment-prefixed and multi-statement forms. That is the sort of fix that looks small in code but matters in a release claim, because it keeps "read-only" from becoming mostly read-only.

2.0.0 also raises the standard for migration confidence. The repo now has cross-engine WAL crash-recovery fixtures in both directions: databases written by better-sqlite3 are opened through node:sqlite, and the rollback direction is tested as well. That is a better claim than "the schema did not change". It proves the practical case users care about, namely that existing logs.db and jobs databases survive the engine change.

The rest of the current product surface is still there, and it is worth remembering what that surface has become. llm-cli-gateway is now a single MCP endpoint for Claude Code, Codex, Gemini, Grok, and Mistral Vibe. It supports sync requests, durable async jobs, restart-safe result collection, job deduplication, cancellation, real CLI session resume paths, cache-aware promptParts, and gateway-managed git worktrees for isolated multi-agent workflows.

The personal-appliance side has also filled out. There is streamable HTTP transport with bearer-token auth, doctor --json, provider setup snippets, Docker fallback, and release bundles for Windows, macOS, and Linux. The GitHub release assets for 2.0.0 include platform binaries, platform bundles, SHA256SUMS, release-manifest.json, and Sigstore signature bundles for verification.

The result is a cleaner distribution story. npm publishes use provenance through GitHub Actions. GitHub release installer artifacts are signed. The production dependency graph is smaller. Native SQLite is gone from consumer installs because SQLite is now supplied by Node itself. The release is not flashy, but it is a serious hardening release: fewer moving parts, fewer install scripts, a narrower persistence boundary, and stronger evidence around upgrade and rollback behaviour.

Links:

Tracking Five Upstreams, Fuzzing the Parsers, and a Front Door: What Changed in llm-cli-gateway

Werner Kasselman — Sat, 30 May 2026 04:23:44 +0000

The last two posts were about features you can call: cache-aware spawning across five providers, and the round before that. This one is mostly about the parts that do not show up as a tool. When you wrap five vendor CLIs that each ship on their own cadence, the interesting failure mode is not a bug in your code, it is one of those five CLIs quietly changing a flag underneath you. So the work that landed this week is about keeping pace with upstreams that move, hardening the bits that parse untrusted output, and finally, giving the project a front door. v1.16.0 through v1.16.2 are tagged and out; the upstream-tracking and Socket-hardening work (changelogged as v1.17.0 and v1.17.1), plus a fast-check fuzzing pass and a dependency-floor bump, have landed on main and go out in the next cut; and the website is now live at llm-cli-gateway.dev, the project's new front door.

Short version: the gateway now tracks each provider CLI's upstream contract as a checked-in artefact. The contract table is pinned by tests that run in CI, an offline npm run upstream:contracts gate re-validates it on demand, and an advisory npm run upstream:scan -- --live reaches out to the upstream changelogs to flag where reality may have moved, so drift surfaces in a check I run rather than as a failed request on a user's machine. A fast-check fuzzing pass now hammers the three parsers that touch untrusted bytes, provider JSON/JSONL, Linux /proc, and the CLI argument sanitizer. Release tags can be Sigstore-signed through a dedicated workflow, the optional Redis layer is gone, and on main the dependency floor has moved to Zod 4 / TypeScript 6 / ESLint 10. And there is now a real website at llm-cli-gateway.dev, built agent-first: an MCP client can read one URL and configure itself.

Long version is below, same shape as last time, problem, what changed, what it now does, caveats named up front rather than buried.

Five upstreams that move (the contract-tracking slice)

The motivating incident is worth naming because it is the whole argument. Mistral's Vibe CLI dropped --output-format in favour of --output text|json|streaming. Nothing in the gateway's own code was wrong; the flag it had been emitting for weeks simply stopped existing on the other side of the spawn. v1.16.1 fixed the call (and kept the legacy MCP aliases mapping plain → text and stream-json → streaming so nobody's saved config broke), but a one-line flag rename that only surfaces as a runtime failure on a user's machine is exactly the class of problem I would rather catch in CI.

So the upstream-tracking work (changelogged as v1.17.0, landed on main) makes the contract a first-class, checked-in thing:

Each supported CLI claude, codex, gemini, grok, mistral gets a maintenance skill describing where its truth lives (Claude Code's markdown changelog, Codex's GitHub releases feed plus product changelog, the Gemini CLI changelog, the xAI markdown release notes, and so on).
The single source of truth for each provider's argv/env behaviour: flags, output modes, session/resume rules, forbidden flags, is the contract table in src/upstream-contracts.ts, exercised by the argument and env validators. Alongside it, docs/upstream/provider-sources.dag.toml is the scanner's source map: which changelog/release pages to watch, and how. The two are deliberately separate, and a test (upstream-sources.test.ts) pins that separation. The source map stays byte-for-byte in sync with the contract table's metadata, and the TOML is asserted not to re-encode the mechanical contract surface. Drift in the source map is a red build; the TOML is never the thing a flag rename has to round-trip through.
scripts/upstream-scan.mjs backs two npm scripts. npm run upstream:contracts is an offline gate, it re-runs the bundled fixtures and the report/TOML-sync check, no network. npm run upstream:scan is network-free by default too; pass --live (npm run upstream:scan -- --live) and it fetches the tracked upstream changelogs and flags, advisorily, where reality may have moved ahead of us. (Neither is wired into the CI gate today, they're tools I run; the TS-contract-vs-source-map sync, however, is a CI test.)

The honest caveat: the live scan is advisory, not authoritative. It tells me where to look; it does not auto-patch a renamed flag, and it never will, because a CLI changing its surface is a thing a human should read and reason about, not a thing a script should silently adapt to. What changed is that the looking is now systematic instead of "wait for a user to file an issue."

Fuzzing the three parsers that touch untrusted bytes

A gateway that spawns five CLIs and reads back their output has a clear trust boundary: everything coming back over stdout/stderr is, from the gateway's point of view, untrusted. Most of it is well-formed. The interesting question is what happens when it is not. So fast-check is now wired into the suite (src/__tests__/fuzz.test.ts), and it targets the three places where malformed input would actually hurt:

Provider JSON / JSONL parsers fuzzed with mixed valid-and-garbage JSONL streams, asserting the parser never throws and never leaks an invalid result shape. A provider emitting a half-written line during a crash should degrade, not propagate a malformed object upward.
Linux /proc parsers the process-health monitor reads /proc/<pid>/stat (state and CPU ticks) and /proc/<pid>/status (VmRSS) to track a spawned child's health. The property here is that no garbage /proc content ever produces a NaN process metric.
CLI argument sanitizer the property is blunt and important: a dash-prefixed value is always rejected. That is the argument-injection guard. The gateway never invokes a CLI with shell: true, but a caller-supplied value that starts with - and slips into the argv array could still be read by the child as a flag rather than a value. The fuzzer's job is to make sure there is no input string that gets past that check.

These are properties, not examples fast-check generates the adversarial inputs rather than me guessing them, which is the point. I am not claiming the parsers are now proven correct; I am claiming the obvious classes of malformed input are exercised on every run instead of on the day a provider ships a bad build.

Signed tags, a smaller surface, a newer floor

A few things in the supply-chain and dependency layer, none of which is a feature, all of which is worth naming.

Sigstore tag signing. The npm publishes already carry sigstore provenance via the OIDC publish path. Since the 1.16.0 cycle the release tags themselves can get the same treatment through a dedicated, manually-triggered sigstore-tag.yml workflow (a workflow_dispatch, run deliberately against a named tag rather than firing automatically on every release) that recreates the tag with a gitsign signature, pinned to the exact commit SHA it must continue to point at, and run in offline Rekor mode. The git history of a release can be made as verifiable as the published artefact.

Socket shellAccess, documented rather than waved away. The gateway's entire reason to exist is launching child processes, so Socket flags it on every release. Rather than ignore the alert, v1.17.1 suppresses it in socket.yml with a written rationale and keeps the bounded shell-access explanation in the README, so a reviewer still sees the reasoning without seeing the same noisy alert on every version bump. The distinction matters: a suppressed alert with a checked-in justification is auditable; a suppressed alert with no paper trail is just hidden.

One fewer optional dependency. v1.16.0 removed the optional Redis/ioredis layer from the PostgreSQL-backed session manager. It was a lever almost nobody pulled, and every optional dependency is a maintenance and supply-chain cost you pay whether or not you use it. The Postgres path is simpler and the dependency surface is smaller.

A newer floor. On main, ahead of the next release, the toolchain moved up in lock-step, Zod 4, TypeScript 6, ESLint 10 (with the lint-config migration that 10 forces), @types/node 25 plus a dead-code sweep that the new compiler and lint settings surfaced. (These are not in the v1.17.x packages yet; they go out in the next cut.) Unglamorous, and exactly the kind of thing that rots if you let it slide for two majors.

A front door (the website)

Until this week the project's front door was a GitHub README and an npm page. Now there is llm-cli-gateway.dev, live as of this post, and the interesting design decision is that it is built agent-first.

The premise: increasingly the thing evaluating whether to install an MCP server is not a human reading marketing copy, it is an agent reading a URL. So the site treats that as the primary path, not an afterthought:

/install.md is agent-readable install instructions in plain markdown, the homepage's headline call to action is literally "Read https://llm-cli-gateway.dev/install.md and configure yourself to use llm-cli-gateway as an MCP server."
/llms.txt is the compact retrieval entry point, and /.well-known/agent.json is structured metadata (registry name io.github.verivus-oss/llm-cli-gateway, transport, launch command) that a tool can parse without scraping HTML.
A /sitemap.md ties the three together for anything doing retrieval.

The human-facing side is deliberately boring: it is a static Cloudflare Pages site (wrangler.toml, output dir site/), ships a strict Content-Security-Policy with script-src 'self', frame-ancestors 'none' and friends in _headers, and the JavaScript makes no external or network calls no analytics, no third-party fonts loaded at runtime, nothing phoning home. For a project whose whole pitch is "the CLIs keep their native credentials and run locally," a marketing site that quietly loaded a tracker would have undercut the argument. So it does not.

The project also picked up its first proper mark this week: a gold gateway "G" drawn out of a terminal prompt (the >_ you spawn everything else from), wrapped in an @-style ring. It is the site favicon, and it anchors the social card at the top of this post.

Caveat, because there is always one: the site is new, and the agent-install path is only as good as the install spec behind it. npx -y llm-cli-gateway over stdio is the whole launch surface, and the install doc is versioned in the repo alongside the code, so it moves when the code moves.

What's next

More providers will drift so the next iteration of the upstream scan is making the advisory live check something a scheduled job runs and reports, rather than something I remember to run. And the fuzzing pass is deliberately narrow right now (three parsers); the session-store and config-loader paths are the obvious next targets once the current properties have a few weeks of green runs behind them.

The bigger item on the board is an XState Store integration (@xstate/store): a small, durable, inspectable piece of workflow state that an orchestrating agent can read and drive through declared events, sitting alongside the sessions and the flight recorder and surviving a restart the way the async jobs already do. It is a plan on disk right now (under docs/plans/), not a shipped tool, and there are a couple of design questions I want to settle (how the state is stored, and how an agent is allowed to change it) before any of it lands.

Thanks for reading this far. As always, MIT licensed.

llm-cli-gateway is MIT licensed. Website: llm-cli-gateway.dev | npm: llm-cli-gateway | GitHub: verivus-oss/llm-cli-gateway

Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On

Werner Kasselman — Tue, 26 May 2026 07:42:37 +0000

If your multi-LLM workload sends the same long system prompt or file dump to Claude / Codex / Gemini ten times an hour, you are paying for the same input tokens ten times. Each provider has a cache for exactly this case, and each one expresses the cache differently. This post is about how llm-cli-gateway now uses those caches for you, across all five providers, without you having to re-implement the per-provider cache APIs yourself. I covered the previous round of changes last week, and I closed that piece with a teaser, that Mistral Vibe was next on the list. A week later, Mistral is in, and a much larger change has landed alongside it, which is what most of this follow-up is about.

The new shape of the gateway: it now understands prompt caching as a first-class concern, across all five providers. That is claude, codex, gemini, grok, and mistral (Vibe). v1.6.0 shipped today and contains the lot.

Short version: every *_request and *_request_async tool now accepts a structured promptParts shape, the gateway concatenates the parts in a canonical order so the stable bytes precede the volatile tail unchanged across calls, three new cache_state:// MCP resources expose hit-rate / hit-count / estimated-savings aggregates back to the orchestrating agent, session_get projects a compact cacheState view at read time, and a cache_ttl_expiring_soon warning fires on Claude resumes when the Anthropic cache breakpoint is within 30 seconds of expiry. All of it is opt-in (every flag defaults off in 1.x), all of it observes the per-provider cache mechanism rather than fighting it, and none of it adds conversation content to gateway storage.

Long version is below, organised the same way I organised last week's post, problem - what changed - what it now does, with the caveats named up front rather than buried.

Mistral Vibe makes five (closing last week's loop)

Mistral shipped Vibe, their open-source CLI coding agent powered by Devstral 2. The gateway now wires mistral_request and mistral_request_async alongside the other four providers. Same shape as the rest, sessions through --resume / --continue (which requires [session_logging] enabled = true in ~/.vibe/config.toml, the doctor surfaces this so you do not get an opaque failure), model registry entries, self-update via the vibe binary itself, the same circuit-breaker, approval-gate, flight recorder, metrics, dedup, and durable-job-store plumbing as the others.

The model alias resolution is slightly different. Vibe has no --model flag, so the gateway injects the resolved alias via VIBE_ACTIVE_MODEL instead. That is the only material divergence from the Claude / Codex / Gemini / Grok pattern, and it is documented inline at the call site.

Now five providers, five model families, five vendor lineages (Anthropic, OpenAI, Google, xAI, Mistral). What I noticed running parallel reviews these past few weeks is that the three OpenAI / Anthropic / Google adjacent triangle agreeing on something is not as informative as it looks, because the three model lineages share a lot of training data and a lot of post-training tendencies. I am not pretending this is statistics, it is just how I use these tools in review work, but adding an xAI voice and a Mistral voice means a five-way agreement is sampled from a meaningfully wider distribution than a three-way agreement, and a one-out-of-five dissent (especially from the vendor-outside-the-triangle) is a data point I read rather than a vote I discard.

promptParts: structured prompts, prefix discipline, no API contortions

The change that took most of the engineering is promptParts. The shape is small:

{
  "promptParts": {
    "system": "You are a careful reviewer of TypeScript diffs.",
    "tools":  "<long, stable description of the tools you can call>",
    "context": "<long, stable file dump or repo summary>",
    "task":    "What did the last patch change?"
  }
}

prompt and promptParts are mutually exclusive, you pass exactly one, the runtime check at the top of every handler returns the exact error message provide exactly one of `prompt` or `promptParts` if you pass both (the backticks belong to the error string itself; the messages are part of the public contract and the tests assert them verbatim). The gateway then concatenates the parts in canonical order, system → tools → context → task, with a stable separator, and hands the resulting string to the CLI's positional -p (or equivalent) argument. The stable prefix bytes precede the volatile task tail unchanged across calls, which is enough for each provider's automatic prompt-caching to land on the same content hash each time.

Two specific points worth naming.

First, this is not a request-body translation layer. The gateway does not construct Anthropic / OpenAI / Mistral JSON request bodies; it spawns the CLI binary the same way it always has. The "cache awareness" sits one layer above, in how the input string is composed before the CLI sees it. That keeps the architectural thesis intact (CLI wrapping, not API proxying) while still giving you cache hygiene for free.

Second, for Claude specifically, the gateway does not yet emit explicit cache_control JSON breakpoints. The Claude Code CLI documents --exclude-dynamic-system-prompt-sections and several ENABLE_PROMPT_CACHING_* / DISABLE_PROMPT_CACHING_* environment variables (all listed in PROVIDER_CACHE_SURFACES.md with citations to the upstream env-vars page), but the path for injecting per-block cache_control markers via stream-json input is probable rather than verified. The [cache_awareness].emit_anthropic_cache_control flag is reserved in config for the follow-up slice that lands a live smoke test, so the present 1.6.0 release ships "Branch B" (prefix discipline only). That is honest about what works and what is gated on verification.

Third (because I said two and meant three), per-model minimum cacheable token thresholds matter. Anthropic Sonnet 3.5–4.6 caches at 1024 tokens minimum; Opus 4.5+ and Haiku 4.5 require 4096; Haiku 3.5 on Vertex needs 2048. The gateway has a [cache_awareness.min_stable_tokens_for_cache_control] per-family table populated from the Anthropic prompt-caching docs and surfaces the lookup via a minStableTokensForModel(config, modelName) helper. The in-code alias table is conservative (it collapses all Haiku variants to 4096 rather than exposing the Vertex-only 2048 distinction); a single-family override can be added when a workload needs it. Slice 1 does not yet act on this (we are not emitting cache_control), but the data is in place for the slice that will.

cache_state://: observability without bleeding prompt text

The supporting piece, and frankly the one that makes the rest defensible, is the observability surface. Three new MCP resources sit alongside the existing sessions:// and models:// resources:

cache_state://global - aggregates across the last 24h, with total_requests, total_hits, hit_rate, total_cache_read_tokens, total_cache_creation_tokens, estimated_savings_usd (best-effort, using a per-model pricing table dated 2026-05-26), and a per-CLI breakdown.
cache_state://session/{sessionId} - per-session aggregates, plus distinct prefix count and (for Claude only) the ttlRemainingMs derived from the configured Anthropic TTL policy.
cache_state://prefix/{hash} - per-stable-prefix-hash aggregates, with a CLI x model breakdown so you can see which providers / models hashed to the same stable prefix.

The structural guarantee: none of these shapes have a prompt / response / system / task field. The session-storage invariant from the project's CLAUDE.md ("no conversation content in session storage") holds, and the new bits add only hash + token-count metadata to the existing flight recorder (which already stored prompts and responses for audit, separate from the session manager). I would not have shipped the observability surface without that constraint, frankly.

The session_get tool now includes a compact cacheState block when the session has prior requests, with cli, prefixDistinct, totalCacheReadTokens, totalCacheCreationTokens, requestCount, hitCount, hitRate, estimatedSavingsUsd, and ttlRemainingMs. The field is omitted entirely for fresh sessions (not null, not empty object), keeping the payload compact when there is nothing to report.

cache_ttl_expiring_soon: warning, not error

Slice 3 is the bit that uses the observability data for actionable warnings. When claude_request (or claude_request_async) is invoked with a sessionId, and [cache_awareness].warn_on_ttl_expiry = true, and the prior session row's lastRequestAt is within 30 seconds of Anthropic's documented TTL (5 minutes by default, 1 hour when [cache_awareness].anthropic_ttl_seconds = 3600), the response payload carries a structured warning:

{
  "warnings": [{
    "code": "cache_ttl_expiring_soon",
    "ttlRemainingMs": 12000,
    "message": "Anthropic cache breakpoint for session ... expires in 12000ms (< 30000ms). Subsequent requests may miss the cache."
  }]
}

It is a warning, not a hard error. The request still runs. The flag defaults to false in 1.x; flip it on once you have observed your traffic for a few days. Two caveats. First, ttlRemainingMs is best-effort, computed locally from our flight recorder's lastRequestAt rather than from Anthropic's actual cache state, so a cache eviction inside Anthropic's window will not be visible to us, the warning may be optimistic. Second, it only fires for Claude. For the other four CLIs, we do not observe the provider's cache state (or, in some cases, the provider does not expose one at all), so the warning would be a guess.

The Codex CLI, however, deserves a specific note. As of 0.133.0, Codex emits cached_input_tokens in its turn.completed.usage payload, verified by a live smoke test on 2026-05-26 (the test invocation, the raw JSONL response, and the field-name divergence from the Anthropic-style cache_read_input_tokens are all captured in docs/personal-mcp/PROVIDER_CACHE_SURFACES.md under the "Codex field name divergence" section; the gateway's src/codex-json-parser.ts was originally written against the Anthropic-style name). The parser's cache_read_tokens column therefore stays null for Codex rows until a follow-up updates the parser to accept the actual field. The observability surface tolerates this without dividing by zero, and the limitation is also documented in the CHANGELOG entry for 1.6.0 so reviewers do not assume Codex telemetry exists when it does not.

The plumbing layer (which is not a feature, but is a habit change)

v1.6.0 also brings a much larger contributor-facing change that does not show up in any tool surface, but is worth naming. The gateway now ships with the same security and validation posture as our agent-assurance spec repository. A new .github/workflows/security.yml runs actionlint, zizmor, shellcheck, typos, osv-scanner, gitleaks, ruff, bandit, and lychee on every push and pull request; eslint-plugin-security is wired into the existing eslint config and runs as part of the standard CI lint step. All third-party actions are SHA-pinned; the Python and Go tools are version-pinned (zizmor==1.25.2, ruff==0.14.5, bandit==1.9.4, actionlint@v1.7.12); the gitleaks binary is downloaded and SHA256-verified before execution. Workflows now use least-privilege permissions, defaulting to contents: read and escalating only on the publish jobs that need OIDC for npm provenance / PyPI trusted publishing or gh release upload; every actions/checkout sets persist-credentials: false except the single job that needs the token for the release upload; the release-installer.yml top-level write was narrowed to that one job. Dependabot expanded from github-actions only to also cover npm and pip, with non-security npm bumps grouped so security updates never get delayed behind a batch.

In flight, osv-scanner flagged 26 Go stdlib CVEs in installer/go.mod (pinned to Go 1.22, when the fixes were in 1.23–1.25.x); that has been bumped to 1.25 in lock-step with the release-installer.yml setup-go pin, and re-verified clean. Two test fixtures and one npmjs.com URL needed allowlisting (a deliberate fake bearer token, an npmjs page that Cloudflare bot-protects, and a similar OpenAI help-centre page), each annotated with the specific reason. There are no real findings outstanding.

This is not the kind of work that ships in a marketing line. It is the work that means the next contributor (or me, six months from now) does not accidentally land a workflow with contents: write and a published-to-cache setup-node step on a release-triggered workflow, which is precisely the kind of supply-chain footgun the Solorigate, Codecov, and xz class of incidents has trained the industry to take seriously. It is the work that means a Dependabot PR with a real CVE fix gets reviewed against an automated gate, not a human's best guess. It is the work that makes claims about supply-chain hygiene auditable rather than aspirational.

Where you can call it from

The cache-awareness story above frames the gateway as something claude-code or codex spawns when an MCP request lands, but that is only one of three inbound surfaces, and it is worth being explicit about the other two because they are how a lot of people actually use the gateway day to day. The gateway is itself an MCP server, so anything that speaks MCP can reach it, and the cache-awareness, observability, and TTL warnings described above apply identically regardless of which surface called in.

stdio MCP from another CLI (the path most of the post has been describing). claude-code, codex, gemini, grok, and vibe each have their own MCP config (~/.claude.json, ~/.codex/config.toml, ~/.gemini/settings.json, and so on); the gateway gets a single entry that wires llm-cli-gateway as the command, and the inbound CLI then sees all of claude_request / codex_request / gemini_request / grok_request / mistral_request plus the session and cache_state:// resources as if they were its own tools.
Claude Desktop through either the local stdio MCP path (same shape as the CLI case, just installed via Claude Desktop's MCP configuration UI) or, where available, the remote MCP connector path against the gateway's HTTP transport. Per-platform setup snippets live in setup/providers/claude-desktop.md; the doctor's client_config.claude_desktop_config_present field tells the install agent which path applies.
ChatGPT custom connectors / developer mode against the gateway's HTTP transport behind a public HTTPS URL. The gateway ships llm-cli-gateway tunnel start and llm-cli-gateway chatgpt-url for the connector wiring; the doctor's endpoint_exposure.web_clients_supported field is the gating boolean. The wrinkle worth knowing about is that ChatGPT requires Authentication: No Authentication on the connector path, so the gateway's LLM_GATEWAY_NO_AUTH_PATHS env var carves out exactly that path while keeping /mcp bearer-token-gated. The walk-through is in setup/providers/chatgpt.md.

llm-cli-gateway doctor --json is the authoritative source for which of these surfaces are wired today, and the install-agent contract at setup/assistants/ASSISTANT_CONTRACT.md is the canonical walk-through, with per-target snippets under setup/providers/. If you want to try the cache-aware flow from inside ChatGPT's developer-mode connector or from Claude Desktop without first installing five upstream CLIs, the stdio MCP path needs only node + the gateway binary and an upstream CLI of your choice; the other four providers go in as and when you add them.

What this changes about the original argument

Nothing, again. The thesis from the original piece was that CLI wrapping gives you capabilities (real file access, real test execution, real session state) that API proxying cannot reach without re-implementing each provider's tool surface. Cache hygiene now joins that list. Each provider's CLI is the right surface to ask "what does this cost?", because each provider's CLI is the only surface that returns telemetry the same way the operator's billing console returns it. The gateway's job is to compose the stable bytes before the volatile bytes so the cache lands on the same content hash, then to read back the resulting cache_read_input_tokens (or cached_input_tokens, depending on the CLI version) from the flight recorder and surface it as an MCP resource the orchestrating agent can act on.

What an API-proxy approach would have to do for the same outcome: construct provider-specific request bodies with per-block cache_control markers, then handle the per-provider divergence in cache field names (cache_read_input_tokens for Anthropic, prompt_tokens_details.cached_tokens for OpenAI, usageMetadata.cachedContentTokenCount for Gemini), then handle the per-provider divergence in TTL policy (5min/1h for Anthropic, implicit-only for OpenAI, separate cachedContents SDK for Gemini), and own the resulting compatibility surface forever. We instead let each CLI own its own provider integration and stand back, sampling the telemetry as it comes out.

If you are evaluating llm-cli-gateway against an API proxy and your workload is heavy on long stable context (file dumps, repo summaries, large system prompts), the question to ask now is not just "does this give me cache hits?", it is "does this give me cache hits I can measure, without me having to re-implement per-provider cache APIs?". That seemed worth writing down.

What's next

The Branch A live smoke test for explicit Claude cache_control injection via --input-format stream-json. The Codex parser fix to accept cached_input_tokens. Async-path flight-recorder integration, so the v3 stable_prefix_hash column gets populated on async jobs too (it does not today, by design, because src/async-job-manager.ts has zero flight-recorder integration, and that is a separate concern). And, once we have 24h of dogfooding data from cache_state://global, the cache-aware multi-LLM routing slice, which is the actual end goal: route a request to the provider whose session has the warmest cache for the requested prefix, rather than the round-robin default.

v1.6.0 is the feature release described above; a docs-only follow-up v1.6.1 went out the same day with the install-agent guidance for Mistral and the post-release doc audit fixes (no source changes). The current published artefacts are at v1.6.1 on npm (with sigstore provenance via the OIDC publish path) and PyPI; the GitHub release at v1.6.1 carries SHA256-verifiable installer artefacts for macOS / Linux / Windows.

Thanks for reading this far. As always, MIT licensed.

llm-cli-gateway is MIT licensed. npm: llm-cli-gateway | GitHub: verivus-oss/llm-cli-gateway

What's new in llm-cli-gateway

Werner Kasselman — Tue, 19 May 2026 04:27:38 +0000

A few weeks ago I wrote Why CLI Wrapping Beats API Proxying for Multi-LLM Development, the case for spawning claude, codex, and gemini as child processes instead of proxying to their APIs. Three things have changed since I published that piece. Two of them fix real limitations I named at the time, and one of them is a new capability that I wish had been there from the start and I think it's worth a follow-up.

Codex sessions are now real, not bookkeeping

In the original post I said llm-cli-gateway uses real CLI continuity flags, "--continue and --resume, not bookkeeping". That was true for Claude and Gemini. For Codex it was, frankly, not quite there.

Codex did not have a documented resume mechanism at the time. So when you opened a Codex session through the gateway, the session record was real (UUID, created/lastUsed timestamps, the active-session-per-CLI invariant) but the codex process itself started fresh on every request. The gateway tagged subsequent requests as belonging to a session, you could see the session in session_list, but Codex did not know that.

Codex shipped exec resume <session-id> and exec resume --last, and the gateway now wires both. If you pass a real Codex session UUID (the kind that lives in ~/.codex/sessions/), codex_request invokes exec resume and you get genuine continuity, the same tool-use history, file context, and partial work the CLI itself preserves. resumeLatest: true pins to the most recent session without you having to look the UUID up.

Two caveats worth naming up front. First, only real Codex UUIDs are accepted, gateway-issued gw-* IDs are rejected on resume, because there is no Codex-side session for them to attach to. Second, --full-auto is dropped on resume, which is a Codex constraint and not something the gateway can paper over. The trade-off is reasonable, in that you keep the continuity, but need to restate the approval policy.

Codex now sits where Claude and Gemini sit. The bullet that said "Session continuity using real CLI flags, not bookkeeping" is now true for all of them.

Grok makes four, on purpose

xAI shipped an official Grok CLI (the grok-build TUI) and I added it as the fourth provider. The tools mirror the others one-for-one, grok_request and grok_request_async, sessions through --resume / --continue, model registry entries, self-update via grok update, the same circuit-breaker and approval-gate plumbing, the same flight recorder, the same metrics. Auth follows the same shape, a prior grok login (OAuth) or a GROK_CODE_XAI_API_KEY environment variable, with GROK_DEFAULT_MODEL, GROK_MODELS, and GROK_MODEL_ALIASES all honoured.

The interesting question is not whether to add Grok (the parity work is mechanical) but why. The case is consensus diversity.

Claude, Codex, and Gemini cover Anthropic, OpenAI, and Google. That lineup is well-suited for parallel review work, but it is three of the same kind of organisation, three model families that share a lot of training data lineage and a lot of post-training tendencies. When you ask all three to red-team the same change, the disagreements are real, but the agreements are sometimes less informative than they look, because you are sampling three points from a narrower distribution than the org names suggest.

Grok's training lineage sits outside the OpenAI/Anthropic/Google adjacent triangle. So when a four-way consensus check returns 4/4 agreement on a security finding, the signal is stronger than 3/3. And when Grok dissents alone, that is a data point worth reading, not a vote to discard. The value is not that Grok is better at reviews than the others (I do not believe that, and the workflows do not assume it). The value is independence.

Durable job results and auto-dedup

This is the change that came from running the gateway against real work for a few months and watching the same failure happen over and over.

The original architecture had a soft spot. Async jobs run long, sometimes longer than the orchestrating agent's polling window. The agent gives up, reissues the request, and the whole Codex or Claude invocation starts over. The CLI work you just paid 90 seconds for is thrown away and replaced with a second 90-second run that does exactly the same thing. I lost track of how much wall time this cost me before I sat down and fixed it properly.

The fix is two pieces, both wired into the existing flight recorder SQLite database at ~/.llm-cli-gateway/logs.db:

Every async job persists to a new jobs table on every state transition (start, throttled output flush, completion). llm_job_status and llm_job_result transparently fall back to the durable store when the in-memory job is gone, so a caller can collect a result regardless of how long ago the work finished. Retention defaults to 30 days, configurable via LLM_GATEWAY_JOB_RETENTION_DAYS. Jobs still "running" when the gateway stops are marked orphaned on next boot, and the partial output stays readable.
Identical requests within a dedup window short-circuit onto the existing running or completed job. The default window is 1 hour, configurable via LLM_GATEWAY_DEDUP_WINDOW_MS. The "polling timed out, reissue, run it all again" loop is structurally gone. For the case where the prior result is actually wrong and you want a fresh invocation rather than a re-attach, every request tool accepts forceRefresh: true.

The change moves the gateway closer to what I wanted it to be from the start, a durable result-collection layer for CLI agents rather than a thin process spawner that hopes the caller is still listening when the CLI finishes. 20 new tests cover persistence, dedup, restart-orphan, retention, and Grok parity, and the full suite passes at 322 tests.

What this changes about the original argument

Nothing, actually. The thesis from the first post still stands, that CLI wrapping gives you capabilities (real file access, real test execution, real session state) that API proxying fundamentally cannot. These three updates strengthen the same case rather than contradict it.

What they fix is the gap between the thesis and the implementation. Codex sessions now carry the same real-CLI continuity as Claude and Gemini. The consensus pattern now has a fourth, vendor-independent voice. And the long-running-job failure mode that always threatened to undercut the whole CLI-spawning approach is gone, because the result lives on disk regardless of who is or is not still polling for it.

If you are evaluating llm-cli-gateway against an API proxy, the comparison is slightly different now than it was in March, on three specific axes. That seemed worth writing down.

What's next?

Mistral shipped Mistral Vibe — their official open-source CLI coding agent, powered by Devstral 2. Will be adding it next for even more diversity!

llm-cli-gateway is MIT licensed. npm: llm-cli-gateway | GitHub: verivus-oss/llm-cli-gateway

Here's what stopped breaking, when you make LLM agents author in two formats

Werner Kasselman — Wed, 06 May 2026 04:39:27 +0000

LLM agents will happily produce a thousand lines of plausible Markdown describing work that doesn't compile, isn't tested, and contradicts a decision the same agent wrote down two files earlier. If you want to review their output without re-reading every paragraph, some of the work product has to be machine-checkable.

You also can't push everything into a schema. Intent, tradeoffs, the alternative you rejected: that material dies in JSON. The interesting question is the boundary. What belongs in prose, what belongs in structure, and what falls out when you draw the line in the wrong place.

I landed on this after running it for real. I introduced the runtime layer later, when I expanded this to multiple repos, and saw the flat files stopped scaling.

The split

Every unit of agent work produces three things:

Narrative. Markdown specs, designs, plans, notes. The human-readable record: intent, tradeoffs, what was rejected, context a future reader needs.
Structure. TOML files encoding the work itself: a dependency DAG, a traceability map (INT → FEAT → REQ → DEC → IMP → CODE → TEST → OUT), and a review-readiness bundle.
Evidence. Review artifacts that answer "is this actually reviewable, and does the claim match the proof?"

Markdown carries what structure can't. Intent and reasoning. Why the design has this shape. What was rejected. What the author worried about. Schema fields can't express ambivalence. Specs change during brainstorm and review, and prose is the right medium for that conversation; forcing every change through schema churn throttles thinking. Six months later, the reviewer needs narrative, not a graph.

TOML carries what prose can't reliably:

Machine-checkable invariants. blocks is the exact inverse of depends_on. Every ART: has exactly one producer. Every consumes matches a produces. These are enforced by validators, not by hoping a human noticed.
Graph queries. What's ready to start? What's the critical path? Which units conflict on files? Which REQ: has no downstream TEST:? These are queries over structure, not reading comprehension.
Stable identifiers. Prose drifts. U07a, REQ:auth-001, ART:schema-v2 don't.
Diff-readable state. A status transition is a one-line diff, not a paragraph to re-read.

Frame the split as narrative vs. structure, each in the medium that protects its own invariants. Calling it "docs vs. config" gets it wrong because both formats are doing real review-time work; one of them just gets to be checked by python -m.

Why TOML and not YAML or JSON

I picked TOML deliberately. YAML loses on parse ambiguity. The country: NO problem (Norway gets parsed as the boolean false under YAML 1.1) is real and gets worse when an LLM is generating the file under time pressure. JSON loses on the human-authoring axis: trailing commas explode, every string needs quotes, comments are forbidden. TOML parses unambiguously, reads cleanly enough to author and review by hand, and ships in the Python stdlib (tomllib since 3.11), so my validators stay dependency-light.

For agent-authored, human-reviewed structure, TOML is the boring choice. It wins because it's boring.

The three review pillars came from failure data

The review-readiness package didn't exist on day one. I added it after running an iteration-chain analysis across seven real review cycles and finding that almost every re-review came from one of three deficiencies, in the same order, over and over.

Missing prerequisite artifacts. Review blocked not on conceptual disagreement but on the absence of required planning docs, cross-links, prior diagrams, or test plans. The reviewer couldn't judge readiness because the artifact class wasn't actually complete.

Ambiguous contracts. Ordering rules, normalization, precedence, fallback, schema shape: reviewers had to infer semantics the author never wrote down. Every inference round added a re-review.

Overclaimed completeness. "Ready for implementation." "Production ready." "All findings resolved." Unbacked by proof, or backed by proof narrower than the claim. Each one cost another round.

Three failure modes, three artifacts. A readiness gate answers whether the artifact class is complete enough to review at all, and blocks opening a review until it passes. A contract declaration makes behavioral semantics explicit up front so reviewers never have to invent them. An evidence matrix binds every strong claim to a concrete proof artifact, a stated scope, and a list of known exclusions; a claim broader than its evidence fails validation.

The workflow is strict and intentionally rude. Fill the readiness gate first; if blocked, don't open review. Fill the contract second; vague statements get rejected. Fill the evidence matrix last; if a claim can't be backed by proof and bounded exclusions, downgrade the claim. Don't stretch the proof.

The validator's exit code is authoritative. No human override of a failed validation without updating the file to pass cleanly. I made this rule on purpose, because "it's close enough" was the phrase that caused most of the re-reviews I measured.

Where flat TOML stopped working

Flat TOML works great for authoring and validation. It stopped working the moment agents started mutating state during execution.

The hand-calculated [computed] sections were the first thing to rot. Critical path, conflict groups, progress percentages: all derived values, all authored by hand, all stale the moment a unit advanced. A human spots the inconsistency on re-read. An agent doesn't.

Editing status = "in_progress" in a text file leaves no record of when, by whom, from what prior state, against what evidence. For process control, "who moved this to done, and on what proof?" is not optional.

There was no programmatic query layer either. "Which tier-1 units are runnable right now?" required parsing TOML, walking the graph in Python, and rebuilding the same derivations every time.

And flat files don't compose across a fleet. Once more than one repo is under the same policy regime, per-repo TOML is the wrong shape for fleet-wide gating, policy packs, exception lifecycles, and release trains.

So I added a runtime layer, additively. The templates and validators didn't change.

A per-repository runtime imports a filled TOML file once. After that, an embedded SurrealDB is the source of truth. Status transitions go through a typed API with validation. Every change persists with timestamps and actor identity. Computed values become live queries instead of hand-edited fields. You can still export a TOML snapshot for human review, but it's a derived artifact, not the authority.

A fleet-wide control plane (FastAPI + Postgres) handles policy packs, signed snapshot intake, exception lifecycles, and release-train readiness across many repos. There's no flat-file counterpart; the multi-repo problem just isn't expressible in per-repo files.

The practical rule: TOML is the authoring medium and the interchange format. The database is the runtime authority. The TOML file you imported is stale from the first state transition onward. Treat it like a git tag — a snapshot in time, not live state.

What you actually get

Four things, none of which prose-only or structure-only would deliver alone.

Parallel agent execution without stepping on each other, because the DAG encodes depends_on, blocks, and files_modify conflict groups explicitly. Agents pick runnable units from the same layer and the system knows who may run concurrently.

Traceability from intent to test. Every requirement has a downstream realization path through implementation, code, and test. Unverified requirements and unmapped code surface as computed gaps in a query, not as gut feeling six weeks into review.

Reviews that fail at the right boundary. Readiness gates block un-reviewable work before a reviewer sees it. Explicit contracts stop the semantic-inference spiral. Evidence matrices stop overclaimed completeness from reaching review at all.

State that is queryable, auditable, versioned, and composable across repos. Single-repo: "what's ready now?" in one query. Fleet-wide: "is this release train green across every repo under policy?" — also one query, against the control plane.

Operating rules

Distilled from getting this wrong before:

Author narrative in Markdown. Author structure in TOML. Don't mix.
Validator exit code 0 is the only pass signal. No manual override.
Don't edit state fields by hand once they're in the runtime. Use the API.
Don't claim "complete," "production-ready," or "all findings resolved" without an evidence matrix. If the matrix is thin, the claim is wrong.
When behavior depends on ordering, fallback, normalization, precedence, or authority, write the contract before review, not during.
Computed fields belong to the runtime. Don't hand-calculate them.

Worked example: this article

I dogfood the same split. The Anti-AI-Tell style guide (mr-k-man/llm-tips on GitHub) is Markdown: rationale, evidence base, the prose rules humans read. The matching contract is TOML — 49 machine-checkable rules with regexes, density thresholds, and applicability tags. And the audit workflow is a 10-unit DAG, also in TOML, that orchestrates inventory, scan, triage, fix, and regression as discrete units that run in parallel where the dependency graph permits.

I ran the DAG on this article before publishing.

The pre-fix audit found two hits:

AIS:ST02 structural: tricolon-fraction 60% (3 of 5 single-token enumerations were three-item).
AIS:F03 formatting: inline-bold density 1.43 per 200 words (10 bolds in 1398 words; budget 7).

Weighted score: 1.0 + 0.25 = 1.25. The rewrite threshold is 3, so this routed to surgical-edit, not rewrite-from-scratch.

Three line-level edits:

Stripped four bullet-label ** markers in the "TOML carries..." list. The bullets already carry the structure; the bold was decoration.
Expanded a three-item prerequisite-artifacts list (docs, cross-links, test plans) to four by adding "prior diagrams".
Expanded a three-item adjective list (queryable, auditable, composable) to four by adding "versioned". The added word is true: the runtime persists history.

Regression scan: zero hits. Tricolon fraction 1 of 5 (20%, under the 30% threshold). Bold density 0.86 per 200 words (under 1.0). Linter exit 0.

You're reading the post-fix version. Everything is in mr-k-man/llm-tips on GitHub: the source guide at style_guide.md, the contract at tools/style_policy.toml, the linter at tools/lint_writing_style.py, and the audit DAG at tools/audit_dag.toml. MIT-licensed.

Takeaway

If you put LLM agents on real work, decide which invariants you want a validator to enforce and which you want a human reviewer to negotiate. Draw that line on purpose. Then accept that flat files have a ceiling: the moment your agents start mutating state, something has to own the audit trail and the live derivations, and a text file isn't it.

Narrative carries judgement; structure carries invariants. Force either of them to carry live state and you'll lose the audit trail inside a week.