<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Werner Kasselman</title>
    <description>The latest articles on DEV Community by Werner Kasselman (@wernerk_au).</description>
    <link>https://dev.to/wernerk_au</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891657%2F74cf21db-2405-4ca2-a8a1-cd612b022882.png</url>
      <title>DEV Community: Werner Kasselman</title>
      <link>https://dev.to/wernerk_au</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wernerk_au"/>
    <language>en</language>
    <item>
      <title>llm-cli-gateway 2.5.0: OAuth for remote MCP connectors and safer workspaces</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Mon, 08 Jun 2026 11:30:47 +0000</pubDate>
      <link>https://dev.to/wernerk_au/llm-cli-gateway-250-oauth-for-remote-mcp-connectors-and-safer-workspaces-4lk4</link>
      <guid>https://dev.to/wernerk_au/llm-cli-gateway-250-oauth-for-remote-mcp-connectors-and-safer-workspaces-4lk4</guid>
      <description>&lt;p&gt;llm-cli-gateway 2.0.0 was the quiet supply-chain release. It moved persistence to Node's built-in &lt;code&gt;node:sqlite&lt;/code&gt;, removed the production &lt;code&gt;better-sqlite3&lt;/code&gt; native install path, and made the package simpler to install and easier to audit.&lt;/p&gt;

&lt;p&gt;That was intentionally not a flashy release. It was about removing risk.&lt;/p&gt;

&lt;p&gt;The releases since then have been about the product surface: making the gateway easier for MCP clients to understand, keeping provider contracts current, adding a direct xAI API path alongside the existing Grok CLI provider, and now making remote MCP connector setup use OAuth instead of credential-shaped URL shortcuts.&lt;/p&gt;

&lt;p&gt;The short version: &lt;code&gt;llm-cli-gateway@2.5.0&lt;/code&gt; is now published on npm, the GitHub release has signed installer artifacts, and the gateway has a safer remote-connector story than it had at 2.0.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.5.0 adds OAuth for remote MCP connectors
&lt;/h2&gt;

&lt;p&gt;The biggest change in 2.5.0 is the remote connector auth model.&lt;/p&gt;

&lt;p&gt;The gateway now exposes public-ready MCP OAuth metadata and an authorization-code flow for remote MCP clients. That means clients such as ChatGPT custom connectors can discover the authorization server, request a code, exchange it for an opaque bearer token, and call the MCP endpoint without relying on a static bearer header pasted into a provider UI.&lt;/p&gt;

&lt;p&gt;The setup shape is deliberately conservative:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;static OAuth clients can be configured with hashed client secrets;&lt;/li&gt;
&lt;li&gt;dynamic client registration is not open by default;&lt;/li&gt;
&lt;li&gt;dynamic registration, when enabled, is gated by either explicit public-client policy or a shared registration secret;&lt;/li&gt;
&lt;li&gt;shared secrets and client secrets are stored only as hashes;&lt;/li&gt;
&lt;li&gt;secrets are never accepted in query strings;&lt;/li&gt;
&lt;li&gt;generated client secrets are copy-once local output;&lt;/li&gt;
&lt;li&gt;doctor, setup JSON, and default CLI output redact secret-bearing fields.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical result is that the public &lt;code&gt;/mcp&lt;/code&gt; endpoint can support remote web connectors through OAuth while local bearer-token clients keep working.&lt;/p&gt;

&lt;h2&gt;
  
  
  The old ChatGPT no-auth URL path is deprecated
&lt;/h2&gt;

&lt;p&gt;Earlier HTTP setup work created a separate high-entropy ChatGPT connector URL because ChatGPT connector setup could not rely on arbitrary static Authorization headers.&lt;/p&gt;

&lt;p&gt;2.5.0 replaces that new-setup path with OAuth.&lt;/p&gt;

&lt;p&gt;The current ChatGPT setup flow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-cli-gateway tunnel start
llm-cli-gateway oauth client add chatgpt &lt;span class="nt"&gt;--redirect-uri&lt;/span&gt; &amp;lt;ChatGPT callback URL&amp;gt; &lt;span class="nt"&gt;--print-once&lt;/span&gt;
llm-cli-gateway print-client-config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In ChatGPT, use the verified public &lt;code&gt;/mcp&lt;/code&gt; URL with &lt;code&gt;Authentication: OAuth&lt;/code&gt;, plus the authorization and token URLs from &lt;code&gt;print-client-config&lt;/code&gt; or the setup UI.&lt;/p&gt;

&lt;p&gt;The old high-entropy no-auth URL remains treated as deprecated compatibility surface only. New setup docs, the setup UI, and assistant runbooks no longer recommend it. Doctor output also redacts old persisted no-auth connector URLs instead of reconstructing them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workspaces are now registered aliases, not arbitrary paths
&lt;/h2&gt;

&lt;p&gt;Remote MCP clients should not be able to browse or select arbitrary local filesystem paths. 2.5.0 adds a workspace registry so provider requests can target a named workspace alias instead.&lt;/p&gt;

&lt;p&gt;The registry supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;workspace aliases;&lt;/li&gt;
&lt;li&gt;configured allowed roots;&lt;/li&gt;
&lt;li&gt;default workspace selection;&lt;/li&gt;
&lt;li&gt;provider request &lt;code&gt;workspace&lt;/code&gt; input across sync and async request tools;&lt;/li&gt;
&lt;li&gt;session metadata so a selected workspace can carry through provider-owned sessions;&lt;/li&gt;
&lt;li&gt;workspace-aware async dedup keys, so the same argv in two different workspaces does not collide.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For local administration there are also workspace creation tools, but they are intentionally narrow. A workspace admin can create a new folder or initialize a new local Git repository under a configured allowed root. The gateway rejects absolute remote paths, traversal, denied directory names, symlink escapes, and existing non-empty targets. There is no network clone in this release.&lt;/p&gt;

&lt;p&gt;That last point is important. This is not a remote filesystem browser and not a general "clone this URL into my machine" tool. It is a controlled local workspace registry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Remote provider requests fail closed before spawning
&lt;/h2&gt;

&lt;p&gt;The security invariant for 2.5.0 is simple: a remote OAuth-authenticated provider request must resolve to a registered workspace before any provider CLI is spawned.&lt;/p&gt;

&lt;p&gt;That applies to the normal provider tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;claude_request&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;codex_request&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gemini_request&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grok_request&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mistral_request&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;the async variants&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also applies to &lt;code&gt;codex_fork_session&lt;/code&gt;, which matters because forking a Codex session is still a provider spawn path.&lt;/p&gt;

&lt;p&gt;Local bearer/stdin callers keep the existing local behavior unless they explicitly ask for unsafe &lt;code&gt;workingDir&lt;/code&gt; or &lt;code&gt;addDir&lt;/code&gt; values. Remote OAuth callers, by contrast, need an explicit workspace, a session-associated workspace, or a configured default workspace. Otherwise the gateway fails before the child process starts.&lt;/p&gt;

&lt;p&gt;That closes off the bad fallback where a remote request silently inherits the gateway process cwd or ends up running in &lt;code&gt;~/.llm-cli-gateway&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.4.0 still matters: direct Grok API and provider-owned sessions
&lt;/h2&gt;

&lt;p&gt;The 2.5.0 release builds on the 2.4.0 product work.&lt;/p&gt;

&lt;p&gt;2.4.0 added a separate direct API provider for xAI: &lt;code&gt;grok-api&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is not a transport flag on &lt;code&gt;grok_request&lt;/code&gt;. It is a distinct provider type and a distinct tool, &lt;code&gt;grok_api_request&lt;/code&gt;, because the API path has a different contract from an agentic CLI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no sandbox or approval-mode flags;&lt;/li&gt;
&lt;li&gt;no CLI process to spawn;&lt;/li&gt;
&lt;li&gt;no &lt;code&gt;grok&lt;/code&gt; local login requirement;&lt;/li&gt;
&lt;li&gt;session continuity through xAI Responses API metadata rather than CLI resume flags;&lt;/li&gt;
&lt;li&gt;API-only request parameters such as xAI Responses fields.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Configuration is isolated under &lt;code&gt;[providers.xai]&lt;/code&gt;. The gateway stores the name of the API-key environment variable, not the secret itself. The tool is only registered when &lt;code&gt;[providers.xai]&lt;/code&gt; is configured and the named environment variable is present.&lt;/p&gt;

&lt;p&gt;Adding &lt;code&gt;grok-api&lt;/code&gt; also forced a useful cleanup: stored gateway sessions are now owned by a provider, not treated as generic strings that any handler might try to resume.&lt;/p&gt;

&lt;p&gt;The wider provider set now includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;claude&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;codex&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gemini&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grok&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mistral&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grok-api&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wrong-provider session reuse is rejected across request handlers instead of failing later in a provider-specific way. A &lt;code&gt;grok-api&lt;/code&gt; session should not be passed to &lt;code&gt;grok_request&lt;/code&gt;, and a Codex session should not be passed to &lt;code&gt;claude_request&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is a boring invariant until it saves you from debugging a bad resume id at the wrong layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP tools are clearer and safer for clients
&lt;/h2&gt;

&lt;p&gt;The 2.1.0, 2.2.0, and 2.3.0 releases were mostly about improving the MCP surface itself.&lt;/p&gt;

&lt;p&gt;2.1.0 added Grok Build 0.2.32 support, including the &lt;code&gt;leaderSocket&lt;/code&gt; parameter for &lt;code&gt;grok_request&lt;/code&gt; and &lt;code&gt;grok_request_async&lt;/code&gt;. It also improved upstream contract drift handling: the gateway can now distinguish hidden upstream flags from true missing flags, and it can acknowledge upstream-only flags that the gateway intentionally does not emit.&lt;/p&gt;

&lt;p&gt;2.2.0 made all tools self-describing. Before that, clients saw tool names and schemas, but not much action-level description. Now the tool descriptions explain what each tool does, when sync requests can defer, why &lt;code&gt;job_status&lt;/code&gt; differs from &lt;code&gt;llm_job_status&lt;/code&gt;, and which tools are local-only.&lt;/p&gt;

&lt;p&gt;2.3.0 added MCP tool annotations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;display titles;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;readOnlyHint&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;destructiveHint&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;idempotentHint&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;openWorldHint&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those annotations let MCP clients build better confirmation UX. A read-only local status tool can be treated differently from a provider-spawning request that may cause an agentic CLI to modify files.&lt;/p&gt;

&lt;p&gt;The important bit is not that the metadata exists. The important bit is that the metadata is tested as an invariant: exact read-only, destructive, and open-world sets are pinned, and contradictory read-only plus destructive annotations are rejected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resource URIs now use valid schemes
&lt;/h2&gt;

&lt;p&gt;MCP Inspector caught a concrete interoperability bug in the resource surface.&lt;/p&gt;

&lt;p&gt;The gateway had advertised resource URIs like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cache_state://global
provider_subcommands://catalog
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those look readable to a human, but underscores are not valid in URI schemes. Standard URL parsing rejected them.&lt;/p&gt;

&lt;p&gt;2.4.0 fixed the advertised resources to use hyphenated schemes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cache-state://global
cache-state://session/{sessionId}
cache-state://prefix/{hash}
provider-subcommands://catalog
provider-subcommands://{provider}/{commandPath}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Legacy direct &lt;code&gt;provider_subcommands://...&lt;/code&gt; reads are still accepted internally for compatibility tests and older direct callers, but standard MCP clients should use the advertised hyphenated forms.&lt;/p&gt;

&lt;p&gt;After the fix, MCP Inspector successfully read every advertised resource: skills, sessions, models, metrics, cache state, provider subcommand catalog, and process health.&lt;/p&gt;

&lt;h2&gt;
  
  
  Provider subcommand contracts are visible
&lt;/h2&gt;

&lt;p&gt;The gateway tracks upstream CLI contracts so it can reject unsupported flags before spawning a provider CLI. 2.4.0 extended the planning and resource side of that work.&lt;/p&gt;

&lt;p&gt;There are now provider subcommand catalog and detail resources, plus tools for listing provider subcommands, reading a subcommand contract, and checking drift.&lt;/p&gt;

&lt;p&gt;This is intentionally CLI-only. The direct &lt;code&gt;grok-api&lt;/code&gt; provider is not a spawnable CLI and does not belong in the same subcommand contract path. That split is explicit.&lt;/p&gt;

&lt;p&gt;The practical value: an MCP client can inspect the provider command surface instead of relying only on prose docs or hardcoded assumptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Host auto-upgrade operations landed
&lt;/h2&gt;

&lt;p&gt;2.4.0 also added an operational path for machines that run the gateway as a local appliance.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;scripts/host-upgrade.sh&lt;/code&gt; flow stages npm releases into versioned directories, verifies the staged binary, applies upgrades atomically, and supports rollback. There are also user systemd service and timer units for scheduled upgrade checks.&lt;/p&gt;

&lt;p&gt;This is not a replacement for the signed GitHub installer artifacts. It is for hosts where npm is the chosen install channel and you want a managed, reversible upgrade loop rather than an ad hoc global install command.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed from the 2.0.0 story
&lt;/h2&gt;

&lt;p&gt;2.0.0 made the package safer to install.&lt;/p&gt;

&lt;p&gt;2.1.0 through 2.5.0 made the gateway better to operate and easier for MCP clients to reason about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grok CLI support stayed current with upstream.&lt;/li&gt;
&lt;li&gt;Tool descriptions and annotations now describe the real behavior of every MCP tool.&lt;/li&gt;
&lt;li&gt;Direct xAI API access exists alongside the Grok CLI path.&lt;/li&gt;
&lt;li&gt;Sessions are provider-owned, so cross-provider resume mistakes fail early.&lt;/li&gt;
&lt;li&gt;Cache and provider-subcommand resources use valid URI schemes.&lt;/li&gt;
&lt;li&gt;Provider subcommand contracts are inspectable through MCP.&lt;/li&gt;
&lt;li&gt;Remote web connector setup now uses MCP OAuth instead of no-auth connector URLs.&lt;/li&gt;
&lt;li&gt;Workspace aliases give remote clients a bounded way to select where provider CLIs run.&lt;/li&gt;
&lt;li&gt;Local workspace creation is constrained to configured allowed roots and local &lt;code&gt;git init&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Host upgrade operations have a staged and rollback-capable path.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gateway is still what it has been from the start: one MCP endpoint that wraps provider CLIs and exposes durable jobs, sessions, validation, review, and provider orchestration.&lt;/p&gt;

&lt;p&gt;The difference is that the surface is now less ambiguous. Clients can see which tools exist, what they do, how risky they are, which resources can be read, which provider owns a session, and which workspace a remote request is allowed to use.&lt;/p&gt;

&lt;p&gt;That is the kind of functionality work that matters after the supply-chain story is handled. Fewer surprises at install time, fewer surprises at runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Release evidence
&lt;/h2&gt;

&lt;p&gt;2.5.0 shipped through the public mirror release path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;npm publishes with GitHub Actions provenance;&lt;/li&gt;
&lt;li&gt;release installer artifacts are signed and uploaded;&lt;/li&gt;
&lt;li&gt;public mirror CI, security, OpenSSF Scorecard, and CodeQL passed on the release commit;&lt;/li&gt;
&lt;li&gt;the local release gate passed &lt;code&gt;go test ./...&lt;/code&gt;, &lt;code&gt;npm run build&lt;/code&gt;, &lt;code&gt;npm run lint&lt;/code&gt;, &lt;code&gt;npm run format:check&lt;/code&gt;, &lt;code&gt;npm test&lt;/code&gt;, and &lt;code&gt;npm run upstream:contracts&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;the full test suite passed at 1,152 tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Release: &lt;a href="https://github.com/verivus-oss/llm-cli-gateway/releases/tag/v2.5.0" rel="noopener noreferrer"&gt;https://github.com/verivus-oss/llm-cli-gateway/releases/tag/v2.5.0&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;npm: &lt;a href="https://www.npmjs.com/package/llm-cli-gateway" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/llm-cli-gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://llm-cli-gateway.dev" rel="noopener noreferrer"&gt;https://llm-cli-gateway.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As always, MIT licensed.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>node</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Reviewing Patrick Collison's Ask for an LLM Workflow Tool</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Sun, 07 Jun 2026 01:21:41 +0000</pubDate>
      <link>https://dev.to/wernerk_au/reviewing-patrick-collisons-ask-for-an-llm-workflow-tool-1odk</link>
      <guid>https://dev.to/wernerk_au/reviewing-patrick-collisons-ask-for-an-llm-workflow-tool-1odk</guid>
      <description>&lt;p&gt;Patrick Collison (&lt;a href="https://x.com/patrickc" rel="noopener noreferrer"&gt;https://x.com/patrickc&lt;/a&gt;) recently outlined the &lt;a href="https://x.com/patrickc/status/2063337800209179029?s=20" rel="noopener noreferrer"&gt;LLM workflow tool&lt;/a&gt; he actually wants. &lt;br&gt;
I know pointing at my own work can read as self-promotion. I'm actually trying to stress test the production model I've been running under the vap umbrella in verivus-oss. &lt;br&gt;
It lands right in that gap (and the evidence from real runs, including public X threads and the recent ledger distribution review, is there).&lt;/p&gt;

&lt;p&gt;Patrick wants:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ability to manage a set of input files (Markdown or similar), plus other general-purpose context.&lt;/li&gt;
&lt;li&gt;Real-time collaboration, with some concept of snapshots or VCS integration.&lt;/li&gt;
&lt;li&gt;The ability to create and manage inference workflows and a stored set of prompts.&lt;/li&gt;
&lt;li&gt;Access to general-purpose coding agents (not just chat models).&lt;/li&gt;
&lt;li&gt;Some concept of compiled outputs or inference results that can be shared externally.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;He summarised the desired feeling as "GNU Autotools × Notion", a system for a body of material that you want to process iteratively, where certain artifacts are important enough to preserve, version, govern, and reason about across time.&lt;/p&gt;

&lt;p&gt;The diagnosis is accurate. For many of us the generation bottleneck has moved. The dominant remaining problems are semantic state that survives many iterations and participants, coordination that doesn't collapse under mixed human and agent work, evidence that actually travels with the work, and governance that keeps intent explicit rather than dissolving into chat history or ad-hoc folders.&lt;/p&gt;

&lt;h2&gt;
  
  
  vap and the living studio surface
&lt;/h2&gt;

&lt;p&gt;vap is the Verivus Assurance Platform, the umbrella under which the open verivus-oss work sits (and under which the deeper substrate in verivusai-labs is being built). The part that directly answers Patrick's friction is the living theatrical production studio, implemented as the agentassurance component.&lt;br&gt;
Every body of work (a product, an initiative, even a single X reply series) becomes a zoomable Production inside the studio. The layout is the interface:&lt;/p&gt;

&lt;p&gt;Productions live in the left sidebar as the hierarchy. I can sit at the full Verivus portfolio level or zoom down to a 22-unit DAG-TOML remediation plan. The same rules and ijbCRUD pane apply at every zoom.&lt;/p&gt;

&lt;p&gt;Workspaces fill the centre: Storyboard for the typed DAGs that hold intent declarations, depends_on and blocks relations, acceptance criteria, and evidence requirements as first-class versionable artifacts; Scene for the current focused rehearsal; Explore for semantic cartography; Working On for the live messy iteration surface.&lt;/p&gt;

&lt;p&gt;Exhibition sits in the right sidebar: the compiled outputs worth preserving and sharing, carrying full chain of custody.&lt;br&gt;
Shared Resources run along the bottom (Props, Cast, Timeline), with Next in Line holding the queued pipeline.&lt;/p&gt;

&lt;p&gt;The central operating verb across every layer is ijbCRUD, provenance-aware and evidence-backed by construction. Closure roots travel with the artifacts. Assertions live in the canon. This is what makes the state survive iterations and participants instead of collapsing back into chat or untrusted folders.&lt;/p&gt;

&lt;p&gt;This is Autotools × Notion lifted into a full production process, grounded in DAG-TOML plus the Agent Assurance specification. Explicit intent, evidence via closure roots, cryptographic provenance, IJB assertions as substrate, runtime-neutral by design.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it maps to the requirements
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Input files plus general-purpose context: sqry (the semantic/living graph and memory layer). Soon to be called scrub on integration.&lt;/li&gt;
&lt;li&gt;Real-time collaboration plus snapshots/VCS: weave (CRDT multi-actor rehearsal system with structural operations) plus ledger (evidence-rich semantic episodes that replace brittle file/branch/commit records).&lt;/li&gt;
&lt;li&gt;Stored prompts plus inference workflows: storyboard (Director’s Planning Board) using typed DAGs as first-class artifacts, dependencies, acceptance criteria, evidence requirements, tiered ranking, and status all explicit.&lt;/li&gt;
&lt;li&gt;General-purpose coding agents: agentfederator (Casting Director). Deliberate multi-LLM routing, frontier models for high-intent planning, quantized open models on capable hardware for execution velocity and cost.&lt;/li&gt;
&lt;li&gt;Compiled outputs that can be shared: Exhibition layer (ledger episodes plus structural codec plus assurance substrate for provenance, signing, and attribution).&lt;/li&gt;
&lt;li&gt;
Supporting roles round it out: ingestor, ijb (the Master Script/Canon), arctos (Production Runtime), bulwark plus vault, and meter.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What's live under verivus-oss today
&lt;/h3&gt;

&lt;p&gt;I've already run this evolving model in public: X replies and crossposts with multi-LLM consensus and evidence traces, the ledger distribution review governed by a living 22-unit DAG-TOML plan. &lt;br&gt;
The plan and its evidence became the shareable Exhibition record. sqry itself has been used in real audits. Earlier articles and repo briefs on dag-toml and the production model are out there too.&lt;br&gt;
These are real, usable artifacts.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's still behind the curtain
&lt;/h3&gt;

&lt;p&gt;The fuller vision lives in the internal verivusai-labs work under vap: the complete substrate (ijb, vault, ledger, integrity, meter and related crates), deeper studio refinements, and day-to-day use on larger efforts. I'm surfacing pieces as they stabilise. The published verivus-oss artifacts are the current on-ramp. This is nights and weekends alongside the day job, completely disconnected, with learnings feeding one way only (#ihaveadayjob).&lt;/p&gt;

&lt;h2&gt;
  
  
  Invitation
&lt;/h2&gt;

&lt;p&gt;If Patrick's description matches the friction you feel doing serious long-running agentic work, context that survives iteration, workflows that are versioned and governed, agents deliberately cast, outputs that can be exhibited with real provenance, this is the direction under vap in verivus-oss.&lt;/p&gt;

&lt;p&gt;The published artifacts are the on-ramp. &lt;/p&gt;

&lt;p&gt;Concrete experiments (running sqry on a real stack, authoring a typed DAG, using the pipeline for output) and precise evidence-based feedback on what would make you want to direct or act in a real Production are especially welcome.&lt;/p&gt;

&lt;p&gt;Repo links and contact in profile. Early collaborators willing to engage the ontology and run real Productions are welcome.&lt;br&gt;
The underlying conviction is that tools of this kind function as cognitive co-processors, common grace that removes a significant portion of the grinding burden of semantic entropy and coordination so the remaining human work can be higher-order direction and faithful stewardship of Productions.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>verivus</category>
      <category>llm</category>
      <category>agents</category>
    </item>
    <item>
      <title>The City-State and the Federation: Two Governance Models for AI Coding Agents</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Thu, 04 Jun 2026 11:04:56 +0000</pubDate>
      <link>https://dev.to/wernerk_au/the-city-state-and-the-federation-two-governance-models-for-ai-coding-agents-5117</link>
      <guid>https://dev.to/wernerk_au/the-city-state-and-the-federation-two-governance-models-for-ai-coding-agents-5117</guid>
      <description>&lt;h2&gt;
  
  
  Why I am writing this
&lt;/h2&gt;

&lt;p&gt;This is the third piece in an accidental series about convergent evolution in agent tooling, and I think it is the most useful one, because this time the two systems being compared are not merely neighbours in the same field, they are the same species of thing: governance systems for AI coding agents, built in the same quarter, by people who have never spoken, with overlapping mechanisms and almost perfectly complementary blind spots.  In &lt;a href="https://dev.to/wernerk_au/dag-toml-how-we-turned-four-months-of-code-review-pain-into-a-machine-checkable-planning-format-236j"&gt;the first article&lt;/a&gt; I described my DAG TOML stack, plans as machine-checkable claims with validators and a fleet control plane behind them, and in &lt;a href="https://dev.to/wernerk_au/the-machine-that-builds-the-machine-and-the-studio-that-runs-itself-two-ways-to-organise-an-agent-5aj1"&gt;the second&lt;/a&gt; I compared two orchestrators.  This one is about &lt;a href="https://github.com/jameshgrn/dgov" rel="noopener noreferrer"&gt;dgov&lt;/a&gt; by James H. Gearon, which describes itself as a "deterministic kernel for multi-agent orchestration via git worktrees".  I should be straight about my method: I did not read the source line by line myself.  I had my agents clone it and do the close reading (roughly 20,000 lines of Python across 70 modules, with 70 test files and a benchmarks document) and I worked from their structured analysis, the project's own documentation and the schema excerpts they pulled, which, given the subject of this article, feels less like a shortcut and more like a demonstration.&lt;/p&gt;

&lt;p&gt;The usual disclaimer applies, doubled: I built one of the two systems, I have neither run nor personally read the other end to end, and any misreadings of dgov are mine (or my agents', which contractually is still mine).  Take this as one practitioner reading a rival constitution with admiration, a highlighter and a research staff, nothing more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two metaphors, both load-bearing
&lt;/h2&gt;

&lt;p&gt;The first thing that struck me reading dgov is that it is built on a legal metaphor, and the metaphor is structural rather than decorative.  There is a governor charter (&lt;code&gt;governor.md&lt;/code&gt;, "Plan first. Respect file claims. Fail closed."), standard operating procedures as statute, an append-only ledger whose entries include a category literally called case law, prompt sections injected into workers under the heading of probation, an error type named &lt;code&gt;ConstitutionalViolation&lt;/code&gt;, and ten documented design pillars covering separation of powers and fail-closed defaults.  The probabilistic worker implements; the deterministic governor plans, validates, reviews and merges.  It is a constitution with an enforcement arm.&lt;/p&gt;

&lt;p&gt;My stack runs on a different metaphor, scientific audit: plans are claims, validators attempt to refute them, completion requires evidence, and a control plane above many repositories evaluates everything against policy.  Law versus science, enforcement versus refutation.  Both metaphors earn their keep, and the differences between the two systems fall out of the metaphors with surprising neatness.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a plan is
&lt;/h2&gt;

&lt;p&gt;In dgov, a plan is a TOML tree compiled to a DAG, and each task carries it's own prompt, the actual work order, alongside file claims (&lt;code&gt;files.create&lt;/code&gt;, &lt;code&gt;files.edit&lt;/code&gt;, &lt;code&gt;files.read&lt;/code&gt; and so on), dependencies, a test command, a role (worker, researcher or reviewer), an iteration budget and a set of tag-matched SOPs that get prepended to the prompt.  The plan is directly dispatchable: compile it, and workers in isolated git worktrees start executing it.  Compilation is fail-closed, cycles and unreachable units and malformed sections are rejected before anything runs.&lt;/p&gt;

&lt;p&gt;In my stack the plan deliberately contains no prompt at all.  A unit carries contracts instead: acceptance criteria, constraints, failure modes, critical decisions, produced and consumed artefacts, and a &lt;code&gt;[computed]&lt;/code&gt; section in which the author must commit to derived claims (critical path, per-layer parallelism, totals) that a validator independently recomputes and diffs.  The plan is not a work order, it is a reviewable artefact that can be &lt;em&gt;refuted&lt;/em&gt; before anyone executes it.&lt;/p&gt;

&lt;p&gt;So dgov closes the loop from plan to execution, and mine closes the loop from plan to review, and neither closes both.  That asymmetry runs through everything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing dgov does that I do not
&lt;/h2&gt;

&lt;p&gt;Credit first, because this is the part that made me sit up.  At settlement time, dgov diffs the worktree and compares the files an agent actually touched against the files the task claimed it would touch, and the comparison is merciless: unclaimed paths reject the merge, reserved paths fail closed, and even reading outside the declared read scope is caught and surfaced.  Git is the source of truth, and the claim is checked against reality mechanically, every time, with no human in the loop.&lt;/p&gt;

&lt;p&gt;I have to concede this carefully, because the first draft of this paragraph conceded it wrongly.  My plan runtime does not do that: my validators refute a plan's &lt;em&gt;self-consistency&lt;/em&gt; (a declared critical path that is not the longest path fails, an artefact with two producers fails), and my evidence matrices require completion claims to name a proof with declared scope and known exclusions, but when a unit is marked done, nothing mechanically diffs the declared file claims against what actually changed.  The honest complication is that the mechanism does exist elsewhere in my stack: my version-control layer, &lt;a href="https://dev.to/wernerk_au/the-next-software-stack-needs-more-than-code-generation-3aep"&gt;aivcs&lt;/a&gt;, records the symbols actually touched in each Episode and attaches evidence with a freshness lifecycle, which is claim-versus-reality binding at symbol granularity, finer than dgov's file granularity.  What I am missing is not the mechanism, it is the wiring: the plan runtime and the version-control layer do not yet check each other.  dgov verifies what happened against what was claimed in one continuous motion; I have both halves of that theorem proved in separate buildings.  Those are different failure modes, and his is the better one.&lt;/p&gt;

&lt;p&gt;dgov has two more mechanisms worth respecting.  Its semantic settlement layer does AST-level analysis of integration candidates before merging, with a failure taxonomy of its own (text conflicts, concurrent edits to the same symbol, duplicate definitions, signature drift, ordering conflicts, and a category called behavioural mismatch), which I found quietly delightful, because building a failure taxonomy and then mechanising it is exactly the move my whole stack came from, except he aimed it at merge integration whilst I aimed it at review iteration.  I will come back to that taxonomy below, because when I checked it against my own cupboard the comparison surprised me in both directions.  And the kernel itself is a pure function from state and event to new state and actions, no I/O, explicit dispatch table, everything event-sourced to SQLite and an append-only deploy log, which means a run is deterministically replayable in a way my live-database runtime is not.  There is even an autofix phase (mechanical lint fixes applied before the validation gates run), which saves the expensive kind of retry where an agent burns an iteration fixing a formatting complaint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing I do that dgov does not
&lt;/h2&gt;

&lt;p&gt;The complementary gaps are just as clean.  dgov has no recomputable derived claims, so a plan whose declared structure is internally wrong in ways a topological check cannot see (an inflated parallelism story, a schedule that ignores the true critical path) executes anyway.  It has no artefact dataflow, no produces and consumes with single-producer ownership, so the failure class where two units quietly both own the canonical definition (the one that once cost me thirteen review iterations) has no mechanical guard.  Its reviewer role is explicitly bounded to the diffs of dependency tasks, one model provider, no multi-model adversarial review, where my process was born precisely from independent reviewers (Codex, Gemini and Claude) disagreeing productively.  Its acceptance story is a test command's exit code, and as I wrote in the first article, half of my December pain came from tests that existed but could not fail, which is exactly the weakness an exit-code gate cannot see and an evidence matrix with known exclusions is built to catch.&lt;/p&gt;

&lt;p&gt;And dgov is constitutionally a city-state.  One repository, one &lt;code&gt;.dgov/&lt;/code&gt; directory, one governor.  It governs its territory completely and stops at the border.  My control plane is the federation layer: policy packs and requirement profiles defined once, per-repository agents pushing signed snapshots, evaluation history, exception lifecycles, release trains across many repositories.  dgov has no analogue, and frankly does not claim to want one, but the moment you run agents across a fleet the federation question arrives whether you invited it or not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The convergence list grows
&lt;/h2&gt;

&lt;p&gt;With the previous article's comparison included, there are now three solo builders (wpank with Bardo, Gearon with dgov, and me) who independently arrived at: declarative task units with explicit dependencies, file claims per task as the precondition for safe parallelism, fail-closed validation before execution, topological ordering, per-task verification commands, an append-only event history, and failure memory carried forward into future attempts (his ledger case law, Bardo's do-not-retry lists, my deficiency taxonomy).  One small coincidence I cannot resist recording: the day dgov's git history was re-bootstrapped for worktree isolation is the same day I authored my first DAG TOML.  Nothing connects the two events except the season, which is rather the point.&lt;/p&gt;

&lt;p&gt;When isolated builders keep meeting at the same mechanisms, the mechanisms are telling you something about the problem, not about the builders.  File claims, fail-closed gates and forwarded failure memory now look to me like the arch and the keystone of this field, the parts every serious system will have because the load demands them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am taking home
&lt;/h2&gt;

&lt;p&gt;I finished reading dgov with a shopping list, which is the highest compliment I know how to pay another person's codebase:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Claim-versus-reality settlement in the plan runtime.  My runtime should refuse to mark a unit done while the actual touched files disagree with the unit's declared file sets, exactly as dgov's review sandbox does, and since my version-control layer already records touched symbols and attached evidence, the work here is plumbing rather than invention.  Still the single highest-value import.&lt;/li&gt;
&lt;li&gt;The placement of merge analysis, not the taxonomy itself.  My first draft of this list said I should import his merge taxonomy, and then I went and audited my own shelves: my semantic merge engine already covers his categories and more (manifest-driven conflict policy per language, tiered degradation down to plain git merge when parsing fails, and a commutativity algebra that formalises what he calls ordering conflicts), and my code-graph layer detects signature drift and duplicate definitions independently.  What dgov actually taught me is &lt;em&gt;where to stand&lt;/em&gt;: he runs merge analysis as a settlement gate inside the plan runtime, every task, every time, whilst my deeper machinery sits in a separate layer that the plan runtime never consults.  The import is the wiring, his architecture carrying my components.&lt;/li&gt;
&lt;li&gt;Fail-closed policy parsing.  dgov rejects malformed SOPs at compile time, required front matter, required sections, no exceptions, and my template ecosystem should hold its own policy documents to the same standard it already holds plans.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And one observation rather than an import.  The most interesting entry in his failure taxonomy is behavioural mismatch, the case where two changes merge cleanly and disagree only at runtime, which is exactly the failure I wrote up &lt;a href="https://dev.to/wernerk_au/the-next-software-stack-needs-more-than-code-generation-3aep"&gt;in an earlier piece&lt;/a&gt; (a pricing path quietly depending on a field another agent had removed, both sides compiling, both passing their tests, git merging without a murmur).  dgov's taxonomy names that crime but cannot yet detect it, because detection needs a relationship graph (which callers depend on which symbols) rather than a diff, and that graph is precisely what the symbol-indexing and predicate layers of my stack exist to provide.  The city-state names the crime; the federation has the forensics.  Neither system has secured a conviction yet, and I suspect whoever gets there first gets there with both halves.&lt;/p&gt;

&lt;p&gt;If Gearon ever reads my side of this, the reciprocal list is above: refutable derived claims, artefact ownership, evidence with declared exclusions, and a story for the day dgov needs to govern more than one city.  And since the comparison should be checkable rather than taken on trust, my side of the format is a public draft specification at &lt;a href="https://agent-assurance.dev" rel="noopener noreferrer"&gt;agent-assurance.dev&lt;/a&gt;, with independent Rust, Go and Python validators, should anyone (including him) want to implement against it.&lt;/p&gt;

&lt;p&gt;Thanks for reading this far, I hope you find some value in the comparison.  If you are building agent governance of your own, whether it leans towards law or towards science, I would genuinely like to hear which theorems you chose to prove mechanically, and which ones you are still taking on trust.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devops</category>
      <category>automation</category>
    </item>
    <item>
      <title>The Machine That Builds the Machine, and the Studio That Runs Itself: Two Ways to Organise an Agent Swarm</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Thu, 04 Jun 2026 10:12:18 +0000</pubDate>
      <link>https://dev.to/wernerk_au/the-machine-that-builds-the-machine-and-the-studio-that-runs-itself-two-ways-to-organise-an-agent-5aj1</link>
      <guid>https://dev.to/wernerk_au/the-machine-that-builds-the-machine-and-the-studio-that-runs-itself-two-ways-to-organise-an-agent-5aj1</guid>
      <description>&lt;h2&gt;
  
  
  Why I am writing this
&lt;/h2&gt;

&lt;p&gt;I thought people might find this comparison useful, because it is rare to get two fully built agent-orchestration systems, designed in complete isolation from each other, solving the same class of problem with enough written detail on both sides to compare them honestly, and rarer still to catch the differences while both are still warm.  Shortly after publishing &lt;a href="https://dev.to/wernerk_au/dag-toml-how-we-turned-four-months-of-code-review-pain-into-a-machine-checkable-planning-format-236j"&gt;my DAG TOML article&lt;/a&gt; I went looking for neighbours and found wpank's write-up, &lt;a href="https://gist.github.com/wpank/e32bb295792a4ded6e52cf2f98d41797" rel="noopener noreferrer"&gt;Building the Machine That Builds the Machine&lt;/a&gt;, which describes Bardo: a meta-system that takes a 234,657-line specification across 343 files and turns it into 26 compiled Rust crates through coordinated agent swarms.  I have my own horse in this race, a system called atelier-studio (roughly 80,000 lines of Rust, built across about five months), and reading his post was the strange experience of recognising my own decisions in a stranger's codebase, and then, more usefully, recognising the places where he and I made opposite calls.&lt;/p&gt;

&lt;p&gt;I am not a neutral reviewer here, I built one of the two systems being compared, so please take this as nothing more than one practitioner reading another practitioner's work with respect and an honest ruler.  Where I describe Bardo I am working from the write-up alone, not the code, and any misreadings are mine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The factory: Bardo
&lt;/h2&gt;

&lt;p&gt;Bardo is project-shaped.  It exists to finish one enormous build: a 26-crate Rust workspace implementing autonomous agents with mortality, dreaming, emotion and economic incentives, specified down to the academic citations (467 of them, Hans Jonas on metabolic freedom and Damasio's somatic markers, to name a few).  The orchestrator, bardo-ctl, is 42,744 lines of Rust, and the part I admire most is around 2,000 lines of bash.&lt;/p&gt;

&lt;p&gt;The bash is a three-stage context engineering pipeline, and frankly it is the heart of the whole design.  Stage one extracts specification sections using a two-source weighted model (inline spec references get double weight over crate-mapped directories).  Stage two decomposes a plan into ordered steps under a 102.4KB context cap, with the rule that each step must compile when combined with all previous steps.  Stage three distils each step down to a 5 to 15KB context slice, carrying forward a one-line summary of what previous steps accomplished, so the agent implementing step 7 never sees the scaffolding from step 1.  The design came, in his words, from watching agents drown in 80KB payloads where maybe 12KB was relevant.&lt;/p&gt;

&lt;p&gt;Above that sits a genuinely complete orchestration layer: around 100 task TOML files declaring files, acceptance criteria, cross-plan dependencies (a task can depend on &lt;code&gt;"17:T1"&lt;/code&gt;, task T1 of plan 17, which lets the scheduler extract parallelism across plan boundaries) and exclusive file claims; a dual-layer DAG with wave scheduling via Kahn's algorithm; a &lt;code&gt;next_runnable()&lt;/code&gt; check that refuses to start any task whose files overlap an in-flight task; 25 agent roles routed to three backends by competence (Codex for refactoring and diagnosis, Cursor for review verdicts, Claude for orchestration and implementation); a gate gauntlet (compile, dependency-deny, test, spec compliance) with a three-failure halt; a parallel three-reviewer panel synthesised by a Critic; git worktrees per plan with a shared sccache so parallel builds cache-hit each other; and a Conductor that nudges silent agents at 300 seconds, restarts stalled ones at 600, and never lets itself starve an Implementer of a spawn slot.&lt;/p&gt;

&lt;p&gt;Two smaller mechanisms deserve a nod because they encode real scars.  The iteration memory builds cumulative DO NOT RETRY lists from compiler errors and review blockers, born from watching an agent hit the same type mismatch four iterations running, each time "fixing" it differently and wrongly.  And the golden-path index records plans that succeeded on the first attempt, categorised, so future decompositions are shown up to two worked examples of the same category.  Failure memory and success memory, both fed forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  The studio: atelier-studio
&lt;/h2&gt;

&lt;p&gt;Atelier-studio is institution-shaped.  Where Bardo exists to finish a build, atelier exists to keep running: a set of standing councils (research, engineering, QA, go-to-market, product and operations) that take a product idea through the whole lifecycle, from market analysis and competitive intelligence through work package decomposition, test planning, service level objectives and launch messaging, backed by a local knowledge graph of around 23,000 ingested items (papers, standards, bodies of knowledge, model registries).&lt;/p&gt;

&lt;p&gt;The design bet is different, and the difference matters.  Bardo diversifies it's agents by skill, routing each role to the backend best at that job.  Atelier diversifies by perspective: each council runs multiple independent planner "flavours" against the same inputs, a Conservative Analyst worrying about risk and compliance, an Optimistic Explorer chasing emerging technology, a Pragmatic Synthesizer weighing cost against time to market (the engineering council has its own trio along minimalism, scalability and maintainability lines), and the outputs are merged through critique and ranking rather than simple voting.  Bardo never argues with itself.  Atelier is built to argue with itself, because in business strategy work the failure mode is not a type mismatch, it is a confident plan that nobody stress-tested from a hostile angle.&lt;/p&gt;

&lt;p&gt;The memory systems differ the same way.  Bardo's learning is textual and rule-shaped, DO NOT RETRY lists an agent must read.  Atelier's is statistical: an attempt tracker feeding a failure oracle that forecasts the probability the next attempt fails (Dirichlet modelling), and a calibration tracker (isotonic regression and Platt scaling) that keeps the system's confidence honest against its actual hit rate.  One remembers what failed, the other models how likely failure is.  Atelier also crosses a line Bardo never attempts: a self-improvement subsystem that proposes changes to atelier's own code, which is exactly why it carries a human-approval safety gate and adversarial review, because a system that rewrites itself needs governance in a way a build factory does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where two strangers built the same parts
&lt;/h2&gt;

&lt;p&gt;The convergence list is long enough that I stopped finding it spooky and started finding it instructive.  Both systems independently arrived at: atomic work units carrying their own acceptance criteria and file sets; explicit dependency DAGs over those units; file-level conflict detection as the precondition for safe parallel agents (Bardo's exclusive-files check is functionally identical to the conflict groups in my DAG TOML runtime); a panel of reviewers with a synthesising verdict; a three-strikes failure budget; failure memory fed forward into the next attempt; success exemplars fed forward as worked examples (his golden paths are, almost word for word, the clean one-pass approvals I used as a negative class when mining my review archive); and isolation of parallel writers via separate working copies.&lt;/p&gt;

&lt;p&gt;None of this was copied.  I found his write-up after building mine, his post does not reference any of my work, and yet the load-bearing safety mechanisms match almost one for one.  When two builders who have never met converge on file-level conflict detection and cumulative do-not-retry memory, that is not fashion, that is the problem itself dictating the shape of the solution, the same way every culture that builds bridges discovers the arch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the philosophies split
&lt;/h2&gt;

&lt;p&gt;Three genuine divergences, and each one traces back to the shape of the work rather than to taste.&lt;/p&gt;

&lt;p&gt;First, static distillation versus living retrieval.  Bardo can precompute context slices because the specification is frozen; the spec is the territory and the pipeline is a map-making exercise done once.  Atelier cannot freeze anything, the knowledge graph keeps growing and the councils query it at run time through a librarian layer with per-council token budgets.  Bardo compiles context, atelier retrieves it.  His closing line, that context engineering is the whole game, the right 12KB delivered at the right time, is the frozen-world statement of the same conviction that made me build the knowledge graph for the unfrozen one.&lt;/p&gt;

&lt;p&gt;Second, skill diversity versus perspective diversity, which I described above and will not repeat, except to note the consequence: Bardo's review panel exists to catch defects, atelier's flavour consensus exists to catch blind spots, and a mature swarm probably needs both.&lt;/p&gt;

&lt;p&gt;Third, the cockpit versus the control plane.  His attempt at headless operation was, in his words, like driving blindfolded, an agent stuck in a compile-fix loop for 15 of 20 unobserved minutes, and his answer was a terminal dashboard with 26 widgets, pause and force-advance controls, and per-role colour coding.  My answer to the same pain was structured event streaming and, eventually, an external control plane that evaluates fleet state from data rather than from watching.  An interactive cockpit against a queryable instrument panel, and I suspect his converts stuck agents into intervention faster, whilst mine scales past the number of screens one person can watch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I take from it
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The safety mechanisms converge, the strategy layers do not.  Conflict detection, acceptance criteria, failure budgets and iteration memory showed up in both systems unprompted, whilst context strategy, diversity strategy and observability strategy split cleanly along the grain of each system's purpose.  If you are building an orchestrator, copy the first list with confidence and choose the second list deliberately.&lt;/li&gt;
&lt;li&gt;Project-shaped and institution-shaped systems want different memory.  A factory can carry it's lessons as text, an institution needs calibration, because the institution will still be making forecasts long after any individual lesson has gone stale.&lt;/li&gt;
&lt;li&gt;Context engineering keeps winning.  Two systems, opposite architectures, same conclusion: not better models, not longer windows, but the right small context at the right moment.&lt;/li&gt;
&lt;li&gt;Synchronicity is evidence.  When isolated builders keep meeting at the same mechanisms, those mechanisms are probably load-bearing for the whole field, and they are the parts I would now least want to be without.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Credit to wpank for a write-up generous enough with internals to make a real comparison possible, that generosity is rarer than the engineering.  Thanks for reading this far, I hope you find some value in my reading of the two machines.  If you have built your own orchestrator and recognise these mechanisms (or, better, if you made a third set of choices entirely), I would genuinely like to hear how the wall pushed back on you.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>automation</category>
      <category>devops</category>
    </item>
    <item>
      <title>DAG TOML: How I Turned Four Months of Code-Review Pain into a Machine-Checkable Planning Format</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Thu, 04 Jun 2026 08:47:24 +0000</pubDate>
      <link>https://dev.to/wernerk_au/dag-toml-how-we-turned-four-months-of-code-review-pain-into-a-machine-checkable-planning-format-236j</link>
      <guid>https://dev.to/wernerk_au/dag-toml-how-we-turned-four-months-of-code-review-pain-into-a-machine-checkable-planning-format-236j</guid>
      <description>&lt;p&gt;&lt;em&gt;Everything below is date-anchored, because the dates matter to the story: I first put agent rules in TOML in October 2025, the failure data runs from December 2025 to March 2026, the first DAG TOML was authored on 2 April 2026, the archive analysis that justified it ran on 4 April 2026, and the database-backed runtime followed across April and May 2026.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I am sharing this
&lt;/h2&gt;

&lt;p&gt;I thought people might find this interesting, and hopefully it saves somebody else a few wasted review rounds, because the cost of the problem I am about to describe is mostly invisible until you sit down and add it up.  I run a multi-agent development process where LLM agents (Claude, Codex CLI and Gemini CLI, to name a few) plan, implement and cross-review each other's work on a Rust codebase, and every work product goes through independent review by at least two different model families before it merges.&lt;/p&gt;

&lt;p&gt;I am not a process-methodology researcher and I have no business publishing failure taxonomies, so please take this as nothing more than me sharing what I found in my own review archive, and what I changed because of it.&lt;/p&gt;

&lt;p&gt;The system works, frankly better than I expected when I started, but through late 2025 it had a churn problem: work kept bouncing back for rereview, and every bounce burned a full review round across multiple models.  So in April 2026 I did something slightly unusual, I treated my own review archive (roughly 2,400 review documents) as a dataset and asked the obvious question: why does work actually bounce?&lt;/p&gt;

&lt;p&gt;This article shows one real chain from that dataset (the December one), the taxonomy that fell out of the analysis, and the fix: implementation plans written as &lt;a href="https://toml.io" rel="noopener noreferrer"&gt;TOML&lt;/a&gt; DAGs with mechanical validators, so that an entire class of review findings became &lt;code&gt;exit 1&lt;/code&gt; instead of a week of iteration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exhibit A: the project-persistence chain (5 and 6 December 2025)
&lt;/h2&gt;

&lt;p&gt;The feature was unglamorous: persist a code-index project's in-memory state (repo index, file table, symbol index) to disk on teardown and reload it on startup, the kind of thing that should be a one-pass review.&lt;/p&gt;

&lt;p&gt;The paper trail, fully dated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 December 2025 - Spec written and approved, with a full planning pack behind it: spec, design, implementation plan and test plan.  Concrete targets: warm restore after restart, persist in under 750 ms for a 50k-symbol index, at least 80% module coverage.&lt;/li&gt;
&lt;li&gt;5 December 2025 - Nine pre-implementation review iterations across three models (3 by Codex, 2 by Gemini, 4 by Claude) before a single line of code was written.&lt;/li&gt;
&lt;li&gt;6 December 2025 - Implementation done.  Two independent post-implementation reviews.  Both returned REQUEST CHANGES.&lt;/li&gt;
&lt;li&gt;6 December 2025 - Fix iteration, second review round, approved the same day.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What did two reviewers find on 6 December, after all that planning?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Finding&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;The restore path overwrote every file's repo ID with &lt;code&gt;NONE&lt;/code&gt;, the persisted ID was simply ignored, so reloaded state was detached from its repositories.  The feature's entire purpose silently didn't work, and a &lt;code&gt;TODO&lt;/code&gt; in the code acknowledged it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;The cache directory from config was trusted verbatim, which meant absolute paths and &lt;code&gt;..&lt;/code&gt; segments could write state outside the project root.  Path traversal, despite the spec explicitly constraining writes to the project root.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MEDIUM&lt;/td&gt;
&lt;td&gt;The config fingerprint (used to invalidate stale persisted state) hashed only 4 of the 7 config fields that affect indexing, so changing the others silently reused stale state.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MEDIUM&lt;/td&gt;
&lt;td&gt;The "concurrency test" spawned four threads on four separate directories.  Same-root races: untested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MEDIUM&lt;/td&gt;
&lt;td&gt;No test ever persisted and restored an actual symbol index, so the headline requirement was unverified.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LOW&lt;/td&gt;
&lt;td&gt;The file was fsynced but the containing directory was not, so a crash after rename could lose the file after logging success.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both reviewers, independently and from different model families, converged on the same top finding.  The second round on 6 December fixed everything with a verification table mapping each finding to specific code and a named test, and it was approved same-day.&lt;/p&gt;

&lt;p&gt;Here is the uncomfortable part: the planning was thorough, the planning reviews were thorough, and the implementation still shipped with it's core feature non-functional and a path-traversal hole.  Plans written in prose don't bind implementations, and reviews of prose can't be rerun.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mining the archive (4 April 2026)
&lt;/h2&gt;

&lt;p&gt;I analysed seven full "iteration chains" (initial request, blocking reviews, rereviews, final approval) spanning December 2025 to March 2026, plus nine clean one-pass approvals as a control group:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;December 2025 - project persistence (above); plugin polish across 4 language plugins ("production-ready" claimed whilst the test matrix said otherwise); and a follow-up where the tests existed but couldn't fail, because non-strict assertions passed even with the feature absent&lt;/li&gt;
&lt;li&gt;December 2025 to January 2026 - a privacy-sensitive planning pack that took 13 iterations, mostly because no single canonical schema existed early and definitions drifted across documents&lt;/li&gt;
&lt;li&gt;10 February 2026 - a policy standard blocked on MUST/SHOULD conflicts and a precedence model that let task instructions override security controls&lt;/li&gt;
&lt;li&gt;February 2026 - a C++ language feature claiming "complete support" whilst its own status docs still described failing tests&lt;/li&gt;
&lt;li&gt;10 March 2026 - a planning pack that burned review rounds 6 and 7 on a missing artefact family and an "ordering is deterministic" claim with no stated ordering rule&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every rereview cause fit one of six categories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Missing artefact completeness - required docs absent, found by the reviewer&lt;/li&gt;
&lt;li&gt;Unstated contracts - "deterministic", "compatible", "safe", with no rule written anywhere&lt;/li&gt;
&lt;li&gt;Drifted contracts - the same concept defined differently across documents&lt;/li&gt;
&lt;li&gt;Evidence gaps - claims broader than tests, and "resolved" without proof&lt;/li&gt;
&lt;li&gt;Boundary rules missing from the design - no privacy, security or filesystem constraints stated&lt;/li&gt;
&lt;li&gt;Boundary rules stated but not enforced - the December path-traversal case, exactly&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And the clean one-pass approvals (all nine of them) shared four traits: bounded scope, already-explicit contracts, evidence matched to claims, and reviewer comments that were refinements rather than prerequisites.&lt;/p&gt;

&lt;p&gt;Notice what the six categories have in common: almost none of them are code bugs.  They are plan-shaped defects, and they are checkable before a reviewer ever looks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: plans as DAGs, in TOML, with a validator (2 April 2026)
&lt;/h2&gt;

&lt;p&gt;The first DAG TOML was authored on 2 April 2026, and the extracted templates and validators followed on 4 April, the same day as the archive analysis.  TOML itself was not new to me, I had been putting agent rules in TOML since 12 October 2025 (a &lt;code&gt;[rules]&lt;/code&gt; never/always prompt policy in one of my Rust projects, with trigger-activated context sections and token budgets), but all through the December-to-March churn the plans themselves stayed in prose, and April was when the plans became TOML too.  I know that a TOML schema for plans might sound like process for the sake of process, but the format makes every plan claim one of three things: a required field, a recomputable assertion, or a gated state transition.&lt;/p&gt;

&lt;p&gt;A plan is a set of units:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[units.U02]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"extract-initial-chain-set"&lt;/span&gt;
&lt;span class="py"&gt;layer&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;tier&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;status&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"done"&lt;/span&gt;             &lt;span class="c"&gt;# pending | in_progress | done | blocked | deferred&lt;/span&gt;
&lt;span class="py"&gt;depends_on&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"U01"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"U04"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;estimated_loc&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;160&lt;/span&gt;
&lt;span class="py"&gt;files_modify&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"research/ANALYSIS_FINDINGS.md"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;acceptance&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="s"&gt;"At least five completed chains are analysed with explicit rereview causes."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;produces&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"ART:initial-chain-findings"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;consumes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"ART:batch-scope"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;critical_decisions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Distinguish content defects from process defects."&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;constraints&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Only count deficiencies that materially forced another iteration."&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;failure_modes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"If extraction drifts into generic summaries, the taxonomy loses causal value."&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;acceptance&lt;/code&gt;, &lt;code&gt;constraints&lt;/code&gt;, &lt;code&gt;failure_modes&lt;/code&gt; and &lt;code&gt;critical_decisions&lt;/code&gt; are required, per unit.  Category 2 (unstated contracts) stops being something a reviewer must notice by absence, it becomes a missing required field.&lt;/p&gt;

&lt;p&gt;Then the plan must declare its own derived properties:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[computed]&lt;/span&gt;
&lt;span class="py"&gt;entry_points&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"U01"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;leaf_nodes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"U05"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;critical_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"U01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"U02"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"U04"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"U05"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;critical_path_loc&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;420&lt;/span&gt;
&lt;span class="nn"&gt;[computed.max_parallel]&lt;/span&gt;
&lt;span class="py"&gt;layer1&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here is the entire trick: a roughly 500-line Python validator (standard library only, &lt;a href="https://docs.python.org/3/library/tomllib.html" rel="noopener noreferrer"&gt;&lt;code&gt;tomllib&lt;/code&gt;&lt;/a&gt; does the parsing) recomputes every one of those claims from the units table and diffs them.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;blocks&lt;/code&gt; must be the exact inverse of &lt;code&gt;depends_on&lt;/code&gt;, so editing one side of a dependency and forgetting the other fails validation with the exact mismatch&lt;/li&gt;
&lt;li&gt;cycles are detected and printed as the actual cycle path&lt;/li&gt;
&lt;li&gt;every &lt;code&gt;ART:&lt;/code&gt; artefact must have exactly one producer, so the "who owns the canonical definition" drift that cost 13 iterations in January becomes a one-line error&lt;/li&gt;
&lt;li&gt;every &lt;code&gt;consumes&lt;/code&gt; must match an existing &lt;code&gt;produces&lt;/code&gt;, so hidden dependencies surface as holes in the plan&lt;/li&gt;
&lt;li&gt;a depender must sit in a strictly higher layer than its dependencies, so overstated parallelism fails&lt;/li&gt;
&lt;li&gt;the declared critical path must be a chain of real edges, start at an entry point, end at a leaf, and match the true longest weighted path (recomputed via toposort), so schedule fantasy fails&lt;/li&gt;
&lt;li&gt;units sharing files must be declared in conflict groups, so two parallel agents about to edit the same file is caught at plan time&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;files_modify&lt;/code&gt; paths must exist in the repo, so plans written against an imagined codebase fail&lt;/li&gt;
&lt;li&gt;placeholders (&lt;code&gt;&amp;lt;fill-in-later&amp;gt;&lt;/code&gt;) are rejected outright&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A wrong plan claim is no longer a reviewer judgement call, it is a failed assertion with a one-line diff.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it changed in review (4 April 2026, first live use)
&lt;/h2&gt;

&lt;p&gt;Two days after the format existed, the first DAG-reviewed plan went through: a plugin cost-tiering feature.  The reviewer's scope line was the TOML file itself, and the verdict was APPROVED in one pass with zero blocking issues, where all four reviewer comments were genuine domain risks (legacy manifest fallback semantics and plugin ID stability, to name a few) rather than structural gaps.&lt;/p&gt;

&lt;p&gt;That is the mechanism working as intended: the structural questions reviewers used to burn rounds on, is anything missing, do the dependencies make sense, what can actually run in parallel, does the timeline claim hold, are pre-answered by validator before the review is even requested, which leaves the reviewer's whole attention for the hard semantic findings, and frankly that is the only thing humans and frontier models should be spending review rounds on.&lt;/p&gt;

&lt;h2&gt;
  
  
  And the December bug class? Gates and evidence matrices
&lt;/h2&gt;

&lt;p&gt;To be clear, the DAG validator alone would not have caught the December path traversal, the reviewers did that, and that finding is category 6 (boundary stated but not enforced in code).  Two companion formats target it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Contract declarations - any plan touching filesystems, ordering, compatibility or fallback must declare the contract explicitly (path-root confinement, traversal handling, atomicity), and each contract names what verifies it.&lt;/li&gt;
&lt;li&gt;Evidence matrices - a "finding resolved" or "feature complete" claim must bind a claim ID to an evidence path plus declared scope plus known exclusions, and the validator checks the evidence file actually exists.  You mechanically cannot say "resolved" without naming a proof that could fail, and if you remember the December tests that couldn't fail, that is exactly the failure mode this kills.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The December chain's second review (the one that passed) was already an informal evidence matrix, every prior finding mapped to specific code lines and a named test.  The format just makes that table mandatory, machine-checked, and required before the review is requested instead of produced during round 2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it went next (April and May 2026)
&lt;/h2&gt;

&lt;p&gt;Static validation only catches problems when someone runs it.  In April and May 2026 the same four invariants moved into a database-backed runtime, where agents import the TOML once and all state lives in the database:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a unit is only offered to an agent when every dependency is &lt;code&gt;done&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;status changes are guarded transitions with history, not string edits&lt;/li&gt;
&lt;li&gt;the inverse-edge, single-producer, consumes-has-producer and layer-ordering invariants are enforced at mutation time&lt;/li&gt;
&lt;li&gt;readiness gates are a query, "is this bundle reviewable?", answered from data before a review request is ever sent&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A nod to the neighbours
&lt;/h2&gt;

&lt;p&gt;After publishing the first version of this piece I went looking for who else had walked this road, and the honest answer is that I was not alone, and in some respects I was not first either.  &lt;a href="https://gptme.org" rel="noopener noreferrer"&gt;gptme&lt;/a&gt; (Erik Bjäreholt's terminal agent) was putting agent context and workspace configuration into a project-level &lt;code&gt;gptme.toml&lt;/code&gt; long before I wrote my first agent rule, and its agent workspaces (tasks, journal, lessons, all git-tracked) are a thoughtful take on the same persistence problem my runtime addresses.  &lt;a href="https://github.com/ducks/lok" rel="noopener noreferrer"&gt;lok&lt;/a&gt; defines declarative multi-backend LLM workflows in TOML, &lt;code&gt;[[steps]]&lt;/code&gt; with &lt;code&gt;depends_on&lt;/code&gt;, retries and consensus thresholds, which is DAG-in-TOML for orchestration, done cleanly.  &lt;a href="https://github.com/jameshgrn/dgov" rel="noopener noreferrer"&gt;dgov&lt;/a&gt; (James H. Gearon) is the closest cousin of the lot: TOML plan trees with task dependencies, compiled to DAGs and dispatched to agents in isolated git worktrees with settlement gates on the way back in.  The &lt;a href="https://gist.github.com/wpank/e32bb295792a4ded6e52cf2f98d41797" rel="noopener noreferrer"&gt;Bardo write-up&lt;/a&gt; ("Building the Machine That Builds the Machine") describes 115 dependency-chained plans and around a hundred task TOMLs feeding agent swarms, the same shape at a scale that makes mine look modest.  And &lt;a href="https://github.com/mezmo/aura" rel="noopener noreferrer"&gt;aura&lt;/a&gt; from the Mezmo team composes whole agents from declarative TOML.&lt;/p&gt;

&lt;p&gt;What strikes me most is the synchronicity of it.  None of these projects reference each other, and I found them only after building mine, yet several teams independently reached for the same move within the same season: take the parts of agent work that used to live in prose and conversation, and push them into a declarative, diffable, machine-readable format.  I do not think that is coincidence, I think it is convergence, because anyone running agents at volume eventually collides with the same wall (plans and claims that read beautifully and bind nothing), and TOML happens to sit in the sweet spot of human-writable and machine-checkable.  Credit where it is due to all of these teams for getting there on their own paths.  If my contribution adds anything on top, it is the validator-first posture: not just expressing the DAG in TOML, but making the plan declare claims that a validator can independently recompute and refute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Your review archive is a dataset.  Seven failure chains and nine clean approvals were enough to find six stable failure categories, and they were stable across different reviewer models, which was the signal that they were real.&lt;/li&gt;
&lt;li&gt;Most rereview causes are plan defects, not code defects.  Plans in prose can't be validated, plans as data can.&lt;/li&gt;
&lt;li&gt;Force derived claims, then recompute them.  The &lt;code&gt;[computed]&lt;/code&gt; section is the idea that pays for everything else here, because making the author commit to parallelism, critical path and totals turns optimism into a checkable assertion.&lt;/li&gt;
&lt;li&gt;"Resolved" must name a proof that could fail.  Half of December's pain was tests that existed but couldn't catch the bug they claimed to cover.&lt;/li&gt;
&lt;li&gt;Spend reviewer rounds only on what machines can't check.  After the switch, my first DAG-reviewed plan went through in one pass, with the reviewer's whole budget spent on real domain risk.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The format described here is no longer internal: DAG-TOML is now a public draft specification at &lt;a href="https://agent-assurance.dev" rel="noopener noreferrer"&gt;agent-assurance.dev&lt;/a&gt;, with independent Rust, Go and Python validators, worked examples, and profile extension points, released under the &lt;a href="https://github.com/verivus-oss/agent-assurance" rel="noopener noreferrer"&gt;verivus-oss/agent-assurance&lt;/a&gt; repository.  The database runtime and the fleet control plane remain internal for now, but the schema ideas (required contract fields, recomputed &lt;code&gt;[computed]&lt;/code&gt; sections, single-producer artefacts, evidence matrices, closure roots) are all in the spec, and you can validate a file against it today.&lt;/p&gt;

&lt;p&gt;Thanks for reading this far, I hope you find some value in my story.  If you have mined your own review archive (and specifically the rereview causes), I would genuinely like to hear what categories you found.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>codequality</category>
      <category>automation</category>
    </item>
    <item>
      <title>llm-cli-gateway 2.0.0: the quiet supply-chain release that matters</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Thu, 04 Jun 2026 08:26:01 +0000</pubDate>
      <link>https://dev.to/wernerk_au/llm-cli-gateway-200-the-quiet-supply-chain-release-that-matters-4een</link>
      <guid>https://dev.to/wernerk_au/llm-cli-gateway-200-the-quiet-supply-chain-release-that-matters-4een</guid>
      <description>&lt;p&gt;llm-cli-gateway 2.0.0 went out on 4 June 2026.  npm now reports 2.0.0 as the latest version, and the public GitHub release carries the platform binaries, bundled installers, SHA256 checksums, release manifest, and Sigstore bundles.&lt;/p&gt;

&lt;p&gt;The headline change is simple: production persistence no longer depends on &lt;code&gt;better-sqlite3&lt;/code&gt;.  The gateway now uses Node's built-in &lt;code&gt;node:sqlite&lt;/code&gt;, behind a single adapter in &lt;code&gt;src/sqlite-driver.ts&lt;/code&gt;, and that one architectural change removes an entire class of install-time supply-chain risk from the consumer tree.&lt;/p&gt;

&lt;p&gt;That matters because the recent 1.17.x work was not really about SQLite as a database.  It was about the native-module install path around &lt;code&gt;better-sqlite3&lt;/code&gt;, specifically the &lt;code&gt;prebuild-install&lt;/code&gt;, &lt;code&gt;tar-fs&lt;/code&gt;, and &lt;code&gt;tar-stream&lt;/code&gt; chain.  In 2.0.0 that chain is not patched, worked around, or hidden behind an advisory.  It is absent from production installs.  The release verification now asserts that consumers get no &lt;code&gt;better-sqlite3&lt;/code&gt;, no &lt;code&gt;prebuild-install&lt;/code&gt;, and no &lt;code&gt;tar-stream&lt;/code&gt; in the installed tree.&lt;/p&gt;

&lt;p&gt;The cost is a real breaking change: Node &lt;code&gt;&amp;gt;=24.4.0&lt;/code&gt; is now required.  That is not arbitrary.  The gateway's persistence layer binds plain objects like &lt;code&gt;{ id: ... }&lt;/code&gt; to &lt;code&gt;@id&lt;/code&gt; SQL placeholders, and Node 24.4 is the point where &lt;code&gt;node:sqlite&lt;/code&gt; has the bare named parameter behaviour this code relies on.  The test suite pins that behaviour so future changes fail loudly rather than turning into quiet persistence bugs.&lt;/p&gt;

&lt;p&gt;The adapter itself is intentionally small.  &lt;code&gt;openDatabase&lt;/code&gt;, &lt;code&gt;openReadOnly&lt;/code&gt;, &lt;code&gt;GatewayDatabase&lt;/code&gt;, and &lt;code&gt;GatewayStatement&lt;/code&gt; are now the surface area, with &lt;code&gt;flight-recorder.ts&lt;/code&gt; and &lt;code&gt;job-store.ts&lt;/code&gt; using that surface instead of touching SQLite directly.  The release security audit enforces that &lt;code&gt;node:sqlite&lt;/code&gt; is referenced only by the adapter, which keeps the persistence boundary clear and reviewable.&lt;/p&gt;

&lt;p&gt;There is one security detail in the read-only path that I particularly like.  &lt;code&gt;queryRequests&lt;/code&gt; now opens a dedicated read-only SQLite connection, so row mutations fail at the SQLite engine level with &lt;code&gt;SQLITE_READONLY&lt;/code&gt;.  During review, one exception was found: &lt;code&gt;VACUUM INTO&lt;/code&gt; can create a new file even on a read-only connection.  The adapter now rejects &lt;code&gt;VACUUM&lt;/code&gt; and &lt;code&gt;VACUUM INTO&lt;/code&gt; on read-only connections, including comment-prefixed and multi-statement forms.  That is the sort of fix that looks small in code but matters in a release claim, because it keeps "read-only" from becoming mostly read-only.&lt;/p&gt;

&lt;p&gt;2.0.0 also raises the standard for migration confidence.  The repo now has cross-engine WAL crash-recovery fixtures in both directions: databases written by &lt;code&gt;better-sqlite3&lt;/code&gt; are opened through &lt;code&gt;node:sqlite&lt;/code&gt;, and the rollback direction is tested as well.  That is a better claim than "the schema did not change".  It proves the practical case users care about, namely that existing &lt;code&gt;logs.db&lt;/code&gt; and jobs databases survive the engine change.&lt;/p&gt;

&lt;p&gt;The rest of the current product surface is still there, and it is worth remembering what that surface has become.  llm-cli-gateway is now a single MCP endpoint for Claude Code, Codex, Gemini, Grok, and Mistral Vibe.  It supports sync requests, durable async jobs, restart-safe result collection, job deduplication, cancellation, real CLI session resume paths, cache-aware &lt;code&gt;promptParts&lt;/code&gt;, and gateway-managed git worktrees for isolated multi-agent workflows.&lt;/p&gt;

&lt;p&gt;The personal-appliance side has also filled out.  There is streamable HTTP transport with bearer-token auth, &lt;code&gt;doctor --json&lt;/code&gt;, provider setup snippets, Docker fallback, and release bundles for Windows, macOS, and Linux.  The GitHub release assets for 2.0.0 include platform binaries, platform bundles, &lt;code&gt;SHA256SUMS&lt;/code&gt;, &lt;code&gt;release-manifest.json&lt;/code&gt;, and Sigstore signature bundles for verification.&lt;/p&gt;

&lt;p&gt;The result is a cleaner distribution story.  npm publishes use provenance through GitHub Actions.  GitHub release installer artifacts are signed.  The production dependency graph is smaller.  Native SQLite is gone from consumer installs because SQLite is now supplied by Node itself.  The release is not flashy, but it is a serious hardening release: fewer moving parts, fewer install scripts, a narrower persistence boundary, and stronger evidence around upgrade and rollback behaviour.&lt;/p&gt;

&lt;p&gt;Links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Release: &lt;a href="https://github.com/verivus-oss/llm-cli-gateway/releases/tag/v2.0.0" rel="noopener noreferrer"&gt;https://github.com/verivus-oss/llm-cli-gateway/releases/tag/v2.0.0&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;npm: &lt;a href="https://www.npmjs.com/package/llm-cli-gateway" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/llm-cli-gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://llm-cli-gateway.dev" rel="noopener noreferrer"&gt;https://llm-cli-gateway.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>node</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Tracking Five Upstreams, Fuzzing the Parsers, and a Front Door: What Changed in llm-cli-gateway</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Sat, 30 May 2026 04:23:44 +0000</pubDate>
      <link>https://dev.to/wernerk_au/tracking-five-upstreams-fuzzing-the-parsers-and-a-front-door-what-changed-in-llm-cli-gateway-3hik</link>
      <guid>https://dev.to/wernerk_au/tracking-five-upstreams-fuzzing-the-parsers-and-a-front-door-what-changed-in-llm-cli-gateway-3hik</guid>
      <description>&lt;p&gt;The last two posts were about features you can call: &lt;a href="https://dev.to/wernerk_au/cache-aware-spawning-what-changed-in-llm-cli-gateway-a-week-on-1dle"&gt;cache-aware spawning&lt;/a&gt; across five providers, and the round before that. This one is mostly about the parts that do not show up as a tool. When you wrap five vendor CLIs that each ship on their own cadence, the interesting failure mode is not a bug in your code, it is one of those five CLIs quietly changing a flag underneath you. So the work that landed this week is about keeping pace with upstreams that move, hardening the bits that parse untrusted output, and finally, giving the project a front door. v1.16.0 through v1.16.2 are tagged and out; the upstream-tracking and Socket-hardening work (changelogged as v1.17.0 and v1.17.1), plus a &lt;code&gt;fast-check&lt;/code&gt; fuzzing pass and a dependency-floor bump, have landed on &lt;code&gt;main&lt;/code&gt; and go out in the next cut; and the website is now live at &lt;a href="https://llm-cli-gateway.dev/" rel="noopener noreferrer"&gt;&lt;code&gt;llm-cli-gateway.dev&lt;/code&gt;&lt;/a&gt;, the project's new front door.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short version:&lt;/strong&gt; the gateway now tracks each provider CLI's upstream contract as a checked-in artefact. The contract table is pinned by tests that run in CI, an offline &lt;code&gt;npm run upstream:contracts&lt;/code&gt; gate re-validates it on demand, and an advisory &lt;code&gt;npm run upstream:scan -- --live&lt;/code&gt; reaches out to the upstream changelogs to flag where reality may have moved, so drift surfaces in a check I run rather than as a failed request on a user's machine. A &lt;code&gt;fast-check&lt;/code&gt; fuzzing pass now hammers the three parsers that touch untrusted bytes, provider JSON/JSONL, Linux &lt;code&gt;/proc&lt;/code&gt;, and the CLI argument sanitizer. Release tags can be Sigstore-signed through a dedicated workflow, the optional Redis layer is gone, and on &lt;code&gt;main&lt;/code&gt; the dependency floor has moved to Zod 4 / TypeScript 6 / ESLint 10. And there is now a real website at &lt;code&gt;llm-cli-gateway.dev&lt;/code&gt;, built agent-first: an MCP client can read one URL and configure itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long version&lt;/strong&gt; is below, same shape as last time, problem, what changed, what it now does, caveats named up front rather than buried.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five upstreams that move (the contract-tracking slice)
&lt;/h2&gt;

&lt;p&gt;The motivating incident is worth naming because it is the whole argument. Mistral's Vibe CLI dropped &lt;code&gt;--output-format&lt;/code&gt; in favour of &lt;code&gt;--output text|json|streaming&lt;/code&gt;. Nothing in the gateway's own code was wrong; the flag it had been emitting for weeks simply stopped existing on the other side of the &lt;code&gt;spawn&lt;/code&gt;. v1.16.1 fixed the call (and kept the legacy MCP aliases mapping &lt;code&gt;plain&lt;/code&gt; → &lt;code&gt;text&lt;/code&gt; and &lt;code&gt;stream-json&lt;/code&gt; → &lt;code&gt;streaming&lt;/code&gt; so nobody's saved config broke), but a one-line flag rename that only surfaces as a runtime failure on a user's machine is exactly the class of problem I would rather catch in CI.&lt;/p&gt;

&lt;p&gt;So the upstream-tracking work (changelogged as v1.17.0, landed on &lt;code&gt;main&lt;/code&gt;) makes the contract a first-class, checked-in thing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each supported CLI &lt;code&gt;claude&lt;/code&gt;, &lt;code&gt;codex&lt;/code&gt;, &lt;code&gt;gemini&lt;/code&gt;, &lt;code&gt;grok&lt;/code&gt;, &lt;code&gt;mistral&lt;/code&gt; gets a &lt;strong&gt;maintenance skill&lt;/strong&gt; describing where its truth lives (Claude Code's markdown changelog, Codex's GitHub releases feed plus product changelog, the Gemini CLI changelog, the xAI markdown release notes, and so on).&lt;/li&gt;
&lt;li&gt;The single source of truth for each provider's argv/env behaviour: flags, output modes, session/resume rules, forbidden flags, is the contract table in &lt;code&gt;src/upstream-contracts.ts&lt;/code&gt;, exercised by the argument and env validators. Alongside it, &lt;code&gt;docs/upstream/provider-sources.dag.toml&lt;/code&gt; is the scanner's &lt;strong&gt;source map&lt;/strong&gt;: which changelog/release pages to watch, and how. The two are deliberately separate, and a test (&lt;code&gt;upstream-sources.test.ts&lt;/code&gt;) pins that separation. The source map stays byte-for-byte in sync with the contract table's metadata, &lt;em&gt;and&lt;/em&gt; the TOML is asserted &lt;strong&gt;not&lt;/strong&gt; to re-encode the mechanical contract surface. Drift in the source map is a red build; the TOML is never the thing a flag rename has to round-trip through.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scripts/upstream-scan.mjs&lt;/code&gt; backs two npm scripts. &lt;code&gt;npm run upstream:contracts&lt;/code&gt; is an &lt;strong&gt;offline&lt;/strong&gt; gate, it re-runs the bundled fixtures and the report/TOML-sync check, no network. &lt;code&gt;npm run upstream:scan&lt;/code&gt; is network-free by default too; pass &lt;code&gt;--live&lt;/code&gt; (&lt;code&gt;npm run upstream:scan -- --live&lt;/code&gt;) and it fetches the tracked upstream changelogs and flags, advisorily, where reality may have moved ahead of us. (Neither is wired into the CI gate today, they're tools I run; the TS-contract-vs-source-map sync, however, &lt;em&gt;is&lt;/em&gt; a CI test.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest caveat: the live scan is advisory, not authoritative. It tells me where to look; it does not auto-patch a renamed flag, and it never will, because a CLI changing its surface is a thing a human should read and reason about, not a thing a script should silently adapt to. What changed is that the looking is now systematic instead of "wait for a user to file an issue."&lt;/p&gt;

&lt;h2&gt;
  
  
  Fuzzing the three parsers that touch untrusted bytes
&lt;/h2&gt;

&lt;p&gt;A gateway that spawns five CLIs and reads back their output has a clear trust boundary: everything coming back over stdout/stderr is, from the gateway's point of view, untrusted. Most of it is well-formed. The interesting question is what happens when it is not. So &lt;code&gt;fast-check&lt;/code&gt; is now wired into the suite (&lt;code&gt;src/__tests__/fuzz.test.ts&lt;/code&gt;), and it targets the three places where malformed input would actually hurt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provider JSON / JSONL parsers&lt;/strong&gt; fuzzed with mixed valid-and-garbage JSONL streams, asserting the parser never throws and never leaks an invalid result shape. A provider emitting a half-written line during a crash should degrade, not propagate a malformed object upward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linux &lt;code&gt;/proc&lt;/code&gt; parsers&lt;/strong&gt; the process-health monitor reads &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/stat&lt;/code&gt; (state and CPU ticks) and &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/status&lt;/code&gt; (&lt;code&gt;VmRSS&lt;/code&gt;) to track a spawned child's health. The property here is that no garbage &lt;code&gt;/proc&lt;/code&gt; content ever produces a &lt;code&gt;NaN&lt;/code&gt; process metric.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLI argument sanitizer&lt;/strong&gt; the property is blunt and important: a dash-prefixed value is &lt;em&gt;always&lt;/em&gt; rejected. That is the argument-injection guard. The gateway never invokes a CLI with &lt;code&gt;shell: true&lt;/code&gt;, but a caller-supplied value that starts with &lt;code&gt;-&lt;/code&gt; and slips into the argv array could still be read by the child as a flag rather than a value. The fuzzer's job is to make sure there is no input string that gets past that check.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are properties, not examples &lt;code&gt;fast-check&lt;/code&gt; generates the adversarial inputs rather than me guessing them, which is the point. I am not claiming the parsers are now proven correct; I am claiming the obvious classes of malformed input are exercised on every run instead of on the day a provider ships a bad build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signed tags, a smaller surface, a newer floor
&lt;/h2&gt;

&lt;p&gt;A few things in the supply-chain and dependency layer, none of which is a feature, all of which is worth naming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sigstore tag signing.&lt;/strong&gt; The npm publishes already carry sigstore provenance via the OIDC publish path. Since the 1.16.0 cycle the release &lt;em&gt;tags&lt;/em&gt; themselves can get the same treatment through a dedicated, manually-triggered &lt;code&gt;sigstore-tag.yml&lt;/code&gt; workflow (a &lt;code&gt;workflow_dispatch&lt;/code&gt;, run deliberately against a named tag rather than firing automatically on every release) that recreates the tag with a gitsign signature, pinned to the exact commit SHA it must continue to point at, and run in offline Rekor mode. The git history of a release can be made as verifiable as the published artefact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Socket &lt;code&gt;shellAccess&lt;/code&gt;, documented rather than waved away.&lt;/strong&gt; The gateway's entire reason to exist is launching child processes, so Socket flags it on every release. Rather than ignore the alert, v1.17.1 suppresses it &lt;em&gt;in &lt;code&gt;socket.yml&lt;/code&gt; with a written rationale&lt;/em&gt; and keeps the bounded shell-access explanation in the README, so a reviewer still sees the reasoning without seeing the same noisy alert on every version bump. The distinction matters: a suppressed alert with a checked-in justification is auditable; a suppressed alert with no paper trail is just hidden.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One fewer optional dependency.&lt;/strong&gt; v1.16.0 removed the optional Redis/ioredis layer from the PostgreSQL-backed session manager. It was a lever almost nobody pulled, and every optional dependency is a maintenance and supply-chain cost you pay whether or not you use it. The Postgres path is simpler and the dependency surface is smaller.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A newer floor.&lt;/strong&gt; On &lt;code&gt;main&lt;/code&gt;, ahead of the next release, the toolchain moved up in lock-step, Zod 4, TypeScript 6, ESLint 10 (with the lint-config migration that 10 forces), &lt;code&gt;@types/node&lt;/code&gt; 25 plus a dead-code sweep that the new compiler and lint settings surfaced. (These are not in the v1.17.x packages yet; they go out in the next cut.) Unglamorous, and exactly the kind of thing that rots if you let it slide for two majors.&lt;/p&gt;

&lt;h2&gt;
  
  
  A front door (the website)
&lt;/h2&gt;

&lt;p&gt;Until this week the project's front door was a GitHub README and an npm page. Now there is &lt;a href="https://llm-cli-gateway.dev/" rel="noopener noreferrer"&gt;&lt;code&gt;llm-cli-gateway.dev&lt;/code&gt;&lt;/a&gt;, live as of this post, and the interesting design decision is that it is built &lt;strong&gt;agent-first&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The premise: increasingly the thing evaluating whether to install an MCP server is not a human reading marketing copy, it is an agent reading a URL. So the site treats that as the primary path, not an afterthought:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/install.md&lt;/code&gt; is agent-readable install instructions in plain markdown, the homepage's headline call to action is literally &lt;em&gt;"Read &lt;a href="https://llm-cli-gateway.dev/install.md" rel="noopener noreferrer"&gt;https://llm-cli-gateway.dev/install.md&lt;/a&gt; and configure yourself to use llm-cli-gateway as an MCP server."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/llms.txt&lt;/code&gt; is the compact retrieval entry point, and &lt;code&gt;/.well-known/agent.json&lt;/code&gt; is structured metadata (registry name &lt;code&gt;io.github.verivus-oss/llm-cli-gateway&lt;/code&gt;, transport, launch command) that a tool can parse without scraping HTML.&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;/sitemap.md&lt;/code&gt; ties the three together for anything doing retrieval.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The human-facing side is deliberately boring: it is a static Cloudflare Pages site (&lt;code&gt;wrangler.toml&lt;/code&gt;, output dir &lt;code&gt;site/&lt;/code&gt;), ships a strict Content-Security-Policy with &lt;code&gt;script-src 'self'&lt;/code&gt;, &lt;code&gt;frame-ancestors 'none'&lt;/code&gt; and friends in &lt;code&gt;_headers&lt;/code&gt;, and the JavaScript makes &lt;strong&gt;no external or network calls&lt;/strong&gt; no analytics, no third-party fonts loaded at runtime, nothing phoning home. For a project whose whole pitch is "the CLIs keep their native credentials and run locally," a marketing site that quietly loaded a tracker would have undercut the argument. So it does not.&lt;/p&gt;

&lt;p&gt;The project also picked up its first proper mark this week: a gold gateway "G" drawn out of a terminal prompt (the &lt;code&gt;&amp;gt;_&lt;/code&gt; you spawn everything else from), wrapped in an &lt;code&gt;@&lt;/code&gt;-style ring. It is the site favicon, and it anchors the social card at the top of this post.&lt;/p&gt;

&lt;p&gt;Caveat, because there is always one: the site is new, and the agent-install path is only as good as the install spec behind it. &lt;code&gt;npx -y llm-cli-gateway&lt;/code&gt; over stdio is the whole launch surface, and the install doc is versioned in the repo alongside the code, so it moves when the code moves.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;More providers will drift so the next iteration of the upstream scan is making the advisory live check something a scheduled job runs and reports, rather than something I remember to run. And the fuzzing pass is deliberately narrow right now (three parsers); the session-store and config-loader paths are the obvious next targets once the current properties have a few weeks of green runs behind them.&lt;/p&gt;

&lt;p&gt;The bigger item on the board is an XState Store integration (&lt;code&gt;@xstate/store&lt;/code&gt;): a small, durable, inspectable piece of workflow state that an orchestrating agent can read and drive through declared events, sitting alongside the sessions and the flight recorder and surviving a restart the way the async jobs already do. It is a plan on disk right now (under &lt;code&gt;docs/plans/&lt;/code&gt;), not a shipped tool, and there are a couple of design questions I want to settle (how the state is stored, and how an agent is allowed to change it) before any of it lands.&lt;/p&gt;

&lt;p&gt;Thanks for reading this far. As always, MIT licensed.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;llm-cli-gateway is MIT licensed. Website: &lt;a href="https://llm-cli-gateway.dev/" rel="noopener noreferrer"&gt;llm-cli-gateway.dev&lt;/a&gt; | npm: &lt;code&gt;llm-cli-gateway&lt;/code&gt; | GitHub: &lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;verivus-oss/llm-cli-gateway&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>cli</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Tue, 26 May 2026 07:42:37 +0000</pubDate>
      <link>https://dev.to/wernerk_au/cache-aware-spawning-what-changed-in-llm-cli-gateway-a-week-on-1dle</link>
      <guid>https://dev.to/wernerk_au/cache-aware-spawning-what-changed-in-llm-cli-gateway-a-week-on-1dle</guid>
      <description>&lt;p&gt;If your multi-LLM workload sends the same long system prompt or file dump to Claude / Codex / Gemini ten times an hour, you are paying for the same input tokens ten times. Each provider has a cache for exactly this case, and each one expresses the cache differently. This post is about how llm-cli-gateway now uses those caches for you, across all five providers, without you having to re-implement the per-provider cache APIs yourself. I covered &lt;a href="https://dev.to/wernerk_au/whats-new-in-llm-cli-gateway-58b8"&gt;the previous round of changes&lt;/a&gt; last week, and I closed that piece with a teaser, that Mistral Vibe was next on the list. A week later, Mistral is in, and a much larger change has landed alongside it, which is what most of this follow-up is about.&lt;/p&gt;

&lt;p&gt;The new shape of the gateway: it now understands prompt caching as a first-class concern, across all five providers. That is &lt;code&gt;claude&lt;/code&gt;, &lt;code&gt;codex&lt;/code&gt;, &lt;code&gt;gemini&lt;/code&gt;, &lt;code&gt;grok&lt;/code&gt;, and &lt;code&gt;mistral&lt;/code&gt; (Vibe). v1.6.0 shipped today and contains the lot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short version:&lt;/strong&gt; every &lt;code&gt;*_request&lt;/code&gt; and &lt;code&gt;*_request_async&lt;/code&gt; tool now accepts a structured &lt;code&gt;promptParts&lt;/code&gt; shape, the gateway concatenates the parts in a canonical order so the stable bytes precede the volatile tail unchanged across calls, three new &lt;code&gt;cache_state://&lt;/code&gt; MCP resources expose hit-rate / hit-count / estimated-savings aggregates back to the orchestrating agent, &lt;code&gt;session_get&lt;/code&gt; projects a compact &lt;code&gt;cacheState&lt;/code&gt; view at read time, and a &lt;code&gt;cache_ttl_expiring_soon&lt;/code&gt; warning fires on Claude resumes when the Anthropic cache breakpoint is within 30 seconds of expiry. All of it is opt-in (every flag defaults off in 1.x), all of it observes the per-provider cache mechanism rather than fighting it, and none of it adds conversation content to gateway storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long version&lt;/strong&gt; is below, organised the same way I organised last week's post, problem - what changed - what it now does, with the caveats named up front rather than buried.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistral Vibe makes five (closing last week's loop)
&lt;/h2&gt;

&lt;p&gt;Mistral shipped &lt;a href="https://docs.mistral.ai/mistral-vibe/overview" rel="noopener noreferrer"&gt;Vibe&lt;/a&gt;, their open-source CLI coding agent powered by Devstral 2. The gateway now wires &lt;code&gt;mistral_request&lt;/code&gt; and &lt;code&gt;mistral_request_async&lt;/code&gt; alongside the other four providers. Same shape as the rest, sessions through &lt;code&gt;--resume&lt;/code&gt; / &lt;code&gt;--continue&lt;/code&gt; (which requires &lt;code&gt;[session_logging] enabled = true&lt;/code&gt; in &lt;code&gt;~/.vibe/config.toml&lt;/code&gt;, the doctor surfaces this so you do not get an opaque failure), model registry entries, self-update via the &lt;code&gt;vibe&lt;/code&gt; binary itself, the same circuit-breaker, approval-gate, flight recorder, metrics, dedup, and durable-job-store plumbing as the others.&lt;/p&gt;

&lt;p&gt;The model alias resolution is slightly different. Vibe has no &lt;code&gt;--model&lt;/code&gt; flag, so the gateway injects the resolved alias via &lt;code&gt;VIBE_ACTIVE_MODEL&lt;/code&gt; instead. That is the only material divergence from the Claude / Codex / Gemini / Grok pattern, and it is documented inline at the call site.&lt;/p&gt;

&lt;p&gt;Now five providers, five model families, five vendor lineages (Anthropic, OpenAI, Google, xAI, Mistral). What I noticed running parallel reviews these past few weeks is that the three OpenAI / Anthropic / Google adjacent triangle agreeing on something is not as informative as it looks, because the three model lineages share a lot of training data and a lot of post-training tendencies. I am not pretending this is statistics, it is just how I use these tools in review work, but adding an xAI voice and a Mistral voice means a five-way agreement is sampled from a meaningfully wider distribution than a three-way agreement, and a one-out-of-five dissent (especially from the vendor-outside-the-triangle) is a data point I read rather than a vote I discard.&lt;/p&gt;

&lt;h2&gt;
  
  
  promptParts: structured prompts, prefix discipline, no API contortions
&lt;/h2&gt;

&lt;p&gt;The change that took most of the engineering is &lt;code&gt;promptParts&lt;/code&gt;. The shape is small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"promptParts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are a careful reviewer of TypeScript diffs."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;long, stable description of the tools you can call&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;long, stable file dump or repo summary&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;"What did the last patch change?"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;prompt&lt;/code&gt; and &lt;code&gt;promptParts&lt;/code&gt; are mutually exclusive, you pass exactly one, the runtime check at the top of every handler returns the exact error message &lt;code&gt;provide exactly one of `prompt` or `promptParts`&lt;/code&gt; if you pass both (the backticks belong to the error string itself; the messages are part of the public contract and the tests assert them verbatim). The gateway then concatenates the parts in canonical order, &lt;code&gt;system&lt;/code&gt; → &lt;code&gt;tools&lt;/code&gt; → &lt;code&gt;context&lt;/code&gt; → &lt;code&gt;task&lt;/code&gt;, with a stable separator, and hands the resulting string to the CLI's positional &lt;code&gt;-p&lt;/code&gt; (or equivalent) argument. The stable prefix bytes precede the volatile &lt;code&gt;task&lt;/code&gt; tail unchanged across calls, which is enough for each provider's automatic prompt-caching to land on the same content hash each time.&lt;/p&gt;

&lt;p&gt;Two specific points worth naming.&lt;/p&gt;

&lt;p&gt;First, this is &lt;strong&gt;not&lt;/strong&gt; a request-body translation layer. The gateway does not construct Anthropic / OpenAI / Mistral JSON request bodies; it spawns the CLI binary the same way it always has. The "cache awareness" sits one layer above, in how the input string is composed before the CLI sees it. That keeps the architectural thesis intact (CLI wrapping, not API proxying) while still giving you cache hygiene for free.&lt;/p&gt;

&lt;p&gt;Second, for Claude specifically, the gateway does not yet emit explicit &lt;code&gt;cache_control&lt;/code&gt; JSON breakpoints. The Claude Code CLI documents &lt;code&gt;--exclude-dynamic-system-prompt-sections&lt;/code&gt; and several &lt;code&gt;ENABLE_PROMPT_CACHING_*&lt;/code&gt; / &lt;code&gt;DISABLE_PROMPT_CACHING_*&lt;/code&gt; environment variables (all listed in &lt;a href="//../personal-mcp/PROVIDER_CACHE_SURFACES.md"&gt;PROVIDER_CACHE_SURFACES.md&lt;/a&gt; with citations to &lt;a href="https://code.claude.com/docs/en/env-vars" rel="noopener noreferrer"&gt;the upstream env-vars page&lt;/a&gt;), but the path for injecting per-block &lt;code&gt;cache_control&lt;/code&gt; markers via stream-json input is probable rather than verified. The &lt;code&gt;[cache_awareness].emit_anthropic_cache_control&lt;/code&gt; flag is reserved in config for the follow-up slice that lands a live smoke test, so the present 1.6.0 release ships "Branch B" (prefix discipline only). That is honest about what works and what is gated on verification.&lt;/p&gt;

&lt;p&gt;Third (because I said two and meant three), per-model minimum cacheable token thresholds matter. Anthropic Sonnet 3.5–4.6 caches at 1024 tokens minimum; Opus 4.5+ and Haiku 4.5 require 4096; Haiku 3.5 on Vertex needs 2048. The gateway has a &lt;code&gt;[cache_awareness.min_stable_tokens_for_cache_control]&lt;/code&gt; per-family table populated from the &lt;a href="https://platform.claude.com/docs/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic prompt-caching docs&lt;/a&gt; and surfaces the lookup via a &lt;code&gt;minStableTokensForModel(config, modelName)&lt;/code&gt; helper. The in-code alias table is conservative (it collapses all Haiku variants to 4096 rather than exposing the Vertex-only 2048 distinction); a single-family override can be added when a workload needs it. Slice 1 does not yet act on this (we are not emitting cache_control), but the data is in place for the slice that will.&lt;/p&gt;

&lt;h2&gt;
  
  
  cache_state://: observability without bleeding prompt text
&lt;/h2&gt;

&lt;p&gt;The supporting piece, and frankly the one that makes the rest defensible, is the observability surface. Three new MCP resources sit alongside the existing &lt;code&gt;sessions://&lt;/code&gt; and &lt;code&gt;models://&lt;/code&gt; resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cache_state://global&lt;/code&gt;&lt;/strong&gt; - aggregates across the last 24h, with &lt;code&gt;total_requests&lt;/code&gt;, &lt;code&gt;total_hits&lt;/code&gt;, &lt;code&gt;hit_rate&lt;/code&gt;, &lt;code&gt;total_cache_read_tokens&lt;/code&gt;, &lt;code&gt;total_cache_creation_tokens&lt;/code&gt;, &lt;code&gt;estimated_savings_usd&lt;/code&gt; (best-effort, using a per-model pricing table dated &lt;code&gt;2026-05-26&lt;/code&gt;), and a per-CLI breakdown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cache_state://session/{sessionId}&lt;/code&gt;&lt;/strong&gt; - per-session aggregates, plus distinct prefix count and (for Claude only) the &lt;code&gt;ttlRemainingMs&lt;/code&gt; derived from the configured Anthropic TTL policy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cache_state://prefix/{hash}&lt;/code&gt;&lt;/strong&gt; - per-stable-prefix-hash aggregates, with a CLI x model breakdown so you can see which providers / models hashed to the same stable prefix.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The structural guarantee: none of these shapes have a &lt;code&gt;prompt&lt;/code&gt; / &lt;code&gt;response&lt;/code&gt; / &lt;code&gt;system&lt;/code&gt; / &lt;code&gt;task&lt;/code&gt; field. The session-storage invariant from the project's &lt;code&gt;CLAUDE.md&lt;/code&gt; ("no conversation content in session storage") holds, and the new bits add only hash + token-count metadata to the existing flight recorder (which already stored prompts and responses for audit, separate from the session manager). I would not have shipped the observability surface without that constraint, frankly.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;session_get&lt;/code&gt; tool now includes a compact &lt;code&gt;cacheState&lt;/code&gt; block when the session has prior requests, with &lt;code&gt;cli&lt;/code&gt;, &lt;code&gt;prefixDistinct&lt;/code&gt;, &lt;code&gt;totalCacheReadTokens&lt;/code&gt;, &lt;code&gt;totalCacheCreationTokens&lt;/code&gt;, &lt;code&gt;requestCount&lt;/code&gt;, &lt;code&gt;hitCount&lt;/code&gt;, &lt;code&gt;hitRate&lt;/code&gt;, &lt;code&gt;estimatedSavingsUsd&lt;/code&gt;, and &lt;code&gt;ttlRemainingMs&lt;/code&gt;. The field is &lt;strong&gt;omitted entirely&lt;/strong&gt; for fresh sessions (not null, not empty object), keeping the payload compact when there is nothing to report.&lt;/p&gt;

&lt;h2&gt;
  
  
  cache_ttl_expiring_soon: warning, not error
&lt;/h2&gt;

&lt;p&gt;Slice 3 is the bit that uses the observability data for actionable warnings. When &lt;code&gt;claude_request&lt;/code&gt; (or &lt;code&gt;claude_request_async&lt;/code&gt;) is invoked with a &lt;code&gt;sessionId&lt;/code&gt;, and &lt;code&gt;[cache_awareness].warn_on_ttl_expiry = true&lt;/code&gt;, and the prior session row's &lt;code&gt;lastRequestAt&lt;/code&gt; is within 30 seconds of Anthropic's documented TTL (5 minutes by default, 1 hour when &lt;code&gt;[cache_awareness].anthropic_ttl_seconds = 3600&lt;/code&gt;), the response payload carries a structured warning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"warnings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cache_ttl_expiring_soon"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ttlRemainingMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Anthropic cache breakpoint for session ... expires in 12000ms (&amp;lt; 30000ms). Subsequent requests may miss the cache."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is a warning, not a hard error. The request still runs. The flag defaults to false in 1.x; flip it on once you have observed your traffic for a few days. Two caveats. First, &lt;code&gt;ttlRemainingMs&lt;/code&gt; is best-effort, computed locally from our flight recorder's &lt;code&gt;lastRequestAt&lt;/code&gt; rather than from Anthropic's actual cache state, so a cache eviction inside Anthropic's window will not be visible to us, the warning may be optimistic. Second, it only fires for Claude. For the other four CLIs, we do not observe the provider's cache state (or, in some cases, the provider does not expose one at all), so the warning would be a guess.&lt;/p&gt;

&lt;p&gt;The Codex CLI, however, deserves a specific note. As of 0.133.0, Codex emits &lt;code&gt;cached_input_tokens&lt;/code&gt; in its &lt;code&gt;turn.completed.usage&lt;/code&gt; payload, verified by a live smoke test on 2026-05-26 (the test invocation, the raw JSONL response, and the field-name divergence from the Anthropic-style &lt;code&gt;cache_read_input_tokens&lt;/code&gt; are all captured in &lt;a href="//../personal-mcp/PROVIDER_CACHE_SURFACES.md"&gt;&lt;code&gt;docs/personal-mcp/PROVIDER_CACHE_SURFACES.md&lt;/code&gt;&lt;/a&gt; under the "Codex field name divergence" section; the gateway's &lt;code&gt;src/codex-json-parser.ts&lt;/code&gt; was originally written against the Anthropic-style name). The parser's &lt;code&gt;cache_read_tokens&lt;/code&gt; column therefore stays null for Codex rows until a follow-up updates the parser to accept the actual field. The observability surface tolerates this without dividing by zero, and the limitation is also documented in the &lt;a href="//../../CHANGELOG.md"&gt;CHANGELOG entry for 1.6.0&lt;/a&gt; so reviewers do not assume Codex telemetry exists when it does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The plumbing layer (which is not a feature, but is a habit change)
&lt;/h2&gt;

&lt;p&gt;v1.6.0 also brings a much larger contributor-facing change that does not show up in any tool surface, but is worth naming. The gateway now ships with the same security and validation posture as our &lt;a href="https://github.com/verivus-oss/agent-assurance" rel="noopener noreferrer"&gt;agent-assurance&lt;/a&gt; spec repository. A new &lt;code&gt;.github/workflows/security.yml&lt;/code&gt; runs actionlint, zizmor, shellcheck, typos, osv-scanner, gitleaks, ruff, bandit, and lychee on every push and pull request; &lt;code&gt;eslint-plugin-security&lt;/code&gt; is wired into the existing eslint config and runs as part of the standard CI lint step. All third-party actions are SHA-pinned; the Python and Go tools are version-pinned (&lt;code&gt;zizmor==1.25.2&lt;/code&gt;, &lt;code&gt;ruff==0.14.5&lt;/code&gt;, &lt;code&gt;bandit==1.9.4&lt;/code&gt;, &lt;code&gt;actionlint@v1.7.12&lt;/code&gt;); the gitleaks binary is downloaded and SHA256-verified before execution. Workflows now use least-privilege permissions, defaulting to &lt;code&gt;contents: read&lt;/code&gt; and escalating only on the publish jobs that need OIDC for npm provenance / PyPI trusted publishing or &lt;code&gt;gh release upload&lt;/code&gt;; every &lt;code&gt;actions/checkout&lt;/code&gt; sets &lt;code&gt;persist-credentials: false&lt;/code&gt; except the single job that needs the token for the release upload; the &lt;code&gt;release-installer.yml&lt;/code&gt; top-level write was narrowed to that one job. Dependabot expanded from github-actions only to also cover npm and pip, with non-security npm bumps grouped so security updates never get delayed behind a batch.&lt;/p&gt;

&lt;p&gt;In flight, osv-scanner flagged 26 Go stdlib CVEs in &lt;code&gt;installer/go.mod&lt;/code&gt; (pinned to Go 1.22, when the fixes were in 1.23–1.25.x); that has been bumped to 1.25 in lock-step with the &lt;code&gt;release-installer.yml&lt;/code&gt; setup-go pin, and re-verified clean. Two test fixtures and one &lt;code&gt;npmjs.com&lt;/code&gt; URL needed allowlisting (a deliberate fake bearer token, an npmjs page that Cloudflare bot-protects, and a similar OpenAI help-centre page), each annotated with the specific reason. There are no real findings outstanding.&lt;/p&gt;

&lt;p&gt;This is not the kind of work that ships in a marketing line. It is the work that means the next contributor (or me, six months from now) does not accidentally land a workflow with &lt;code&gt;contents: write&lt;/code&gt; and a published-to-cache &lt;code&gt;setup-node&lt;/code&gt; step on a release-triggered workflow, which is precisely the kind of supply-chain footgun the &lt;a href="https://en.wikipedia.org/wiki/SolarWinds" rel="noopener noreferrer"&gt;Solorigate&lt;/a&gt;, &lt;a href="https://about.codecov.io/security-update/" rel="noopener noreferrer"&gt;Codecov&lt;/a&gt;, and &lt;a href="https://en.wikipedia.org/wiki/XZ_Utils_backdoor" rel="noopener noreferrer"&gt;xz&lt;/a&gt; class of incidents has trained the industry to take seriously. It is the work that means a Dependabot PR with a real CVE fix gets reviewed against an automated gate, not a human's best guess. It is the work that makes claims about supply-chain hygiene auditable rather than aspirational.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where you can call it from
&lt;/h2&gt;

&lt;p&gt;The cache-awareness story above frames the gateway as something &lt;code&gt;claude-code&lt;/code&gt; or &lt;code&gt;codex&lt;/code&gt; spawns when an MCP request lands, but that is only one of three inbound surfaces, and it is worth being explicit about the other two because they are how a lot of people actually use the gateway day to day. The gateway is itself an MCP server, so anything that speaks MCP can reach it, and the cache-awareness, observability, and TTL warnings described above apply identically regardless of which surface called in.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;stdio MCP from another CLI&lt;/strong&gt; (the path most of the post has been describing). &lt;code&gt;claude-code&lt;/code&gt;, &lt;code&gt;codex&lt;/code&gt;, &lt;code&gt;gemini&lt;/code&gt;, &lt;code&gt;grok&lt;/code&gt;, and &lt;code&gt;vibe&lt;/code&gt; each have their own MCP config (&lt;code&gt;~/.claude.json&lt;/code&gt;, &lt;code&gt;~/.codex/config.toml&lt;/code&gt;, &lt;code&gt;~/.gemini/settings.json&lt;/code&gt;, and so on); the gateway gets a single entry that wires &lt;code&gt;llm-cli-gateway&lt;/code&gt; as the command, and the inbound CLI then sees all of &lt;code&gt;claude_request&lt;/code&gt; / &lt;code&gt;codex_request&lt;/code&gt; / &lt;code&gt;gemini_request&lt;/code&gt; / &lt;code&gt;grok_request&lt;/code&gt; / &lt;code&gt;mistral_request&lt;/code&gt; plus the session and &lt;code&gt;cache_state://&lt;/code&gt; resources as if they were its own tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Desktop&lt;/strong&gt; through either the local stdio MCP path (same shape as the CLI case, just installed via Claude Desktop's MCP configuration UI) or, where available, the remote MCP connector path against the gateway's HTTP transport. Per-platform setup snippets live in &lt;a href="//../../setup/providers/claude-desktop.md"&gt;&lt;code&gt;setup/providers/claude-desktop.md&lt;/code&gt;&lt;/a&gt;; the doctor's &lt;code&gt;client_config.claude_desktop_config_present&lt;/code&gt; field tells the install agent which path applies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT custom connectors / developer mode&lt;/strong&gt; against the gateway's HTTP transport behind a public HTTPS URL. The gateway ships &lt;code&gt;llm-cli-gateway tunnel start&lt;/code&gt; and &lt;code&gt;llm-cli-gateway chatgpt-url&lt;/code&gt; for the connector wiring; the doctor's &lt;code&gt;endpoint_exposure.web_clients_supported&lt;/code&gt; field is the gating boolean. The wrinkle worth knowing about is that ChatGPT requires &lt;code&gt;Authentication: No Authentication&lt;/code&gt; on the connector path, so the gateway's &lt;code&gt;LLM_GATEWAY_NO_AUTH_PATHS&lt;/code&gt; env var carves out exactly that path while keeping &lt;code&gt;/mcp&lt;/code&gt; bearer-token-gated. The walk-through is in &lt;a href="//../../setup/providers/chatgpt.md"&gt;&lt;code&gt;setup/providers/chatgpt.md&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;llm-cli-gateway doctor --json&lt;/code&gt; is the authoritative source for which of these surfaces are wired today, and the install-agent contract at &lt;a href="//../../setup/assistants/ASSISTANT_CONTRACT.md"&gt;&lt;code&gt;setup/assistants/ASSISTANT_CONTRACT.md&lt;/code&gt;&lt;/a&gt; is the canonical walk-through, with per-target snippets under &lt;a href="//../../setup/providers/"&gt;&lt;code&gt;setup/providers/&lt;/code&gt;&lt;/a&gt;. If you want to try the cache-aware flow from inside ChatGPT's developer-mode connector or from Claude Desktop without first installing five upstream CLIs, the stdio MCP path needs only &lt;code&gt;node&lt;/code&gt; + the gateway binary and an upstream CLI of your choice; the other four providers go in as and when you add them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes about the original argument
&lt;/h2&gt;

&lt;p&gt;Nothing, again. The thesis from &lt;a href="//./blog-cli-vs-api.md"&gt;the original piece&lt;/a&gt; was that CLI wrapping gives you capabilities (real file access, real test execution, real session state) that API proxying cannot reach without re-implementing each provider's tool surface. Cache hygiene now joins that list. Each provider's CLI is the right surface to ask "what does this cost?", because each provider's CLI is the only surface that returns telemetry the same way the operator's billing console returns it. The gateway's job is to compose the stable bytes before the volatile bytes so the cache lands on the same content hash, then to read back the resulting &lt;code&gt;cache_read_input_tokens&lt;/code&gt; (or &lt;code&gt;cached_input_tokens&lt;/code&gt;, depending on the CLI version) from the flight recorder and surface it as an MCP resource the orchestrating agent can act on.&lt;/p&gt;

&lt;p&gt;What an API-proxy approach would have to do for the same outcome: construct provider-specific request bodies with per-block &lt;code&gt;cache_control&lt;/code&gt; markers, then handle the per-provider divergence in cache field names (&lt;code&gt;cache_read_input_tokens&lt;/code&gt; for Anthropic, &lt;code&gt;prompt_tokens_details.cached_tokens&lt;/code&gt; for OpenAI, &lt;code&gt;usageMetadata.cachedContentTokenCount&lt;/code&gt; for Gemini), then handle the per-provider divergence in TTL policy (5min/1h for Anthropic, implicit-only for OpenAI, separate &lt;code&gt;cachedContents&lt;/code&gt; SDK for Gemini), and own the resulting compatibility surface forever. We instead let each CLI own its own provider integration and stand back, sampling the telemetry as it comes out.&lt;/p&gt;

&lt;p&gt;If you are evaluating llm-cli-gateway against an API proxy and your workload is heavy on long stable context (file dumps, repo summaries, large system prompts), the question to ask now is not just "does this give me cache hits?", it is "does this give me cache hits I can measure, without me having to re-implement per-provider cache APIs?". That seemed worth writing down.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The Branch A live smoke test for explicit Claude &lt;code&gt;cache_control&lt;/code&gt; injection via &lt;code&gt;--input-format stream-json&lt;/code&gt;. The Codex parser fix to accept &lt;code&gt;cached_input_tokens&lt;/code&gt;. Async-path flight-recorder integration, so the v3 &lt;code&gt;stable_prefix_hash&lt;/code&gt; column gets populated on async jobs too (it does not today, by design, because &lt;code&gt;src/async-job-manager.ts&lt;/code&gt; has zero flight-recorder integration, and that is a separate concern). And, once we have 24h of dogfooding data from &lt;code&gt;cache_state://global&lt;/code&gt;, the cache-aware multi-LLM routing slice, which is the actual end goal: route a request to the provider whose session has the warmest cache for the requested prefix, rather than the round-robin default.&lt;/p&gt;

&lt;p&gt;v1.6.0 is the feature release described above; a docs-only follow-up v1.6.1 went out the same day with the install-agent guidance for Mistral and the post-release doc audit fixes (no source changes). The current published artefacts are at v1.6.1 on &lt;a href="https://npmjs.com/package/llm-cli-gateway" rel="noopener noreferrer"&gt;npm&lt;/a&gt; (with sigstore provenance via the OIDC publish path) and &lt;a href="https://pypi.org/project/llm-cli-gateway/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;; the &lt;a href="https://github.com/verivus-oss/llm-cli-gateway/releases/tag/v1.6.1" rel="noopener noreferrer"&gt;GitHub release at v1.6.1&lt;/a&gt; carries SHA256-verifiable installer artefacts for macOS / Linux / Windows.&lt;/p&gt;

&lt;p&gt;Thanks for reading this far. As always, MIT licensed.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;llm-cli-gateway is MIT licensed. npm: &lt;code&gt;llm-cli-gateway&lt;/code&gt; | GitHub: &lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;verivus-oss/llm-cli-gateway&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>cli</category>
      <category>opensource</category>
    </item>
    <item>
      <title>What's new in llm-cli-gateway</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Tue, 19 May 2026 04:27:38 +0000</pubDate>
      <link>https://dev.to/wernerk_au/whats-new-in-llm-cli-gateway-58b8</link>
      <guid>https://dev.to/wernerk_au/whats-new-in-llm-cli-gateway-58b8</guid>
      <description>&lt;p&gt;A few weeks ago I wrote &lt;a href="https://medium.com/@wernerk/why-cli-wrapping-beats-api-proxying-for-multi-llm-development-1ddd492c7153" rel="noopener noreferrer"&gt;Why CLI Wrapping Beats API Proxying for Multi-LLM Development&lt;/a&gt;, the case for spawning &lt;code&gt;claude&lt;/code&gt;, &lt;code&gt;codex&lt;/code&gt;, and &lt;code&gt;gemini&lt;/code&gt; as child processes instead of proxying to their APIs. Three things have changed since I published that piece. Two of them fix real limitations I named at the time, and one of them is a new capability that I wish had been there from the start and I think it's worth a follow-up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Codex sessions are now real, not bookkeeping
&lt;/h2&gt;

&lt;p&gt;In the original post I said llm-cli-gateway uses real CLI continuity flags, "&lt;code&gt;--continue&lt;/code&gt; and &lt;code&gt;--resume&lt;/code&gt;, not bookkeeping". That was true for Claude and Gemini. For Codex it was, frankly, not quite there.&lt;/p&gt;

&lt;p&gt;Codex did not have a documented resume mechanism at the time. So when you opened a Codex session through the gateway, the session record was real (UUID, created/lastUsed timestamps, the active-session-per-CLI invariant) but the &lt;code&gt;codex&lt;/code&gt; process itself started fresh on every request. The gateway tagged subsequent requests as belonging to a session, you could see the session in &lt;code&gt;session_list&lt;/code&gt;, but Codex did not know that.&lt;/p&gt;

&lt;p&gt;Codex shipped &lt;code&gt;exec resume &amp;lt;session-id&amp;gt;&lt;/code&gt; and &lt;code&gt;exec resume --last&lt;/code&gt;, and the gateway now wires both. If you pass a real Codex session UUID (the kind that lives in &lt;code&gt;~/.codex/sessions/&lt;/code&gt;), &lt;code&gt;codex_request&lt;/code&gt; invokes &lt;code&gt;exec resume&lt;/code&gt; and you get genuine continuity, the same tool-use history, file context, and partial work the CLI itself preserves. &lt;code&gt;resumeLatest: true&lt;/code&gt; pins to the most recent session without you having to look the UUID up.&lt;/p&gt;

&lt;p&gt;Two caveats worth naming up front. First, only real Codex UUIDs are accepted, gateway-issued &lt;code&gt;gw-*&lt;/code&gt; IDs are rejected on resume, because there is no Codex-side session for them to attach to. Second, &lt;code&gt;--full-auto&lt;/code&gt; is dropped on resume, which is a Codex constraint and not something the gateway can paper over. The trade-off is reasonable, in that you keep the continuity, but need to restate the approval policy.&lt;/p&gt;

&lt;p&gt;Codex now sits where Claude and Gemini sit. The bullet that said "Session continuity using real CLI flags, not bookkeeping" is now true for all of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grok makes four, on purpose
&lt;/h2&gt;

&lt;p&gt;xAI shipped an official Grok CLI (the &lt;code&gt;grok-build&lt;/code&gt; TUI) and I added it as the fourth provider. The tools mirror the others one-for-one, &lt;code&gt;grok_request&lt;/code&gt; and &lt;code&gt;grok_request_async&lt;/code&gt;, sessions through &lt;code&gt;--resume&lt;/code&gt; / &lt;code&gt;--continue&lt;/code&gt;, model registry entries, self-update via &lt;code&gt;grok update&lt;/code&gt;, the same circuit-breaker and approval-gate plumbing, the same flight recorder, the same metrics. Auth follows the same shape, a prior &lt;code&gt;grok login&lt;/code&gt; (OAuth) or a &lt;code&gt;GROK_CODE_XAI_API_KEY&lt;/code&gt; environment variable, with &lt;code&gt;GROK_DEFAULT_MODEL&lt;/code&gt;, &lt;code&gt;GROK_MODELS&lt;/code&gt;, and &lt;code&gt;GROK_MODEL_ALIASES&lt;/code&gt; all honoured.&lt;/p&gt;

&lt;p&gt;The interesting question is not whether to add Grok (the parity work is mechanical) but why. The case is consensus diversity.&lt;/p&gt;

&lt;p&gt;Claude, Codex, and Gemini cover Anthropic, OpenAI, and Google. That lineup is well-suited for parallel review work, but it is three of the same kind of organisation, three model families that share a lot of training data lineage and a lot of post-training tendencies. When you ask all three to red-team the same change, the disagreements are real, but the agreements are sometimes less informative than they look, because you are sampling three points from a narrower distribution than the org names suggest.&lt;/p&gt;

&lt;p&gt;Grok's training lineage sits outside the OpenAI/Anthropic/Google adjacent triangle. So when a four-way consensus check returns 4/4 agreement on a security finding, the signal is stronger than 3/3. And when Grok dissents alone, that is a data point worth reading, not a vote to discard. The value is not that Grok is better at reviews than the others (I do not believe that, and the workflows do not assume it). The value is independence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Durable job results and auto-dedup
&lt;/h2&gt;

&lt;p&gt;This is the change that came from running the gateway against real work for a few months and watching the same failure happen over and over.&lt;/p&gt;

&lt;p&gt;The original architecture had a soft spot. Async jobs run long, sometimes longer than the orchestrating agent's polling window. The agent gives up, reissues the request, and the whole Codex or Claude invocation starts over. The CLI work you just paid 90 seconds for is thrown away and replaced with a second 90-second run that does exactly the same thing. I lost track of how much wall time this cost me before I sat down and fixed it properly.&lt;/p&gt;

&lt;p&gt;The fix is two pieces, both wired into the existing flight recorder SQLite database at &lt;code&gt;~/.llm-cli-gateway/logs.db&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Every async job persists&lt;/strong&gt; to a new &lt;code&gt;jobs&lt;/code&gt; table on every state transition (start, throttled output flush, completion). &lt;code&gt;llm_job_status&lt;/code&gt; and &lt;code&gt;llm_job_result&lt;/code&gt; transparently fall back to the durable store when the in-memory job is gone, so a caller can collect a result regardless of how long ago the work finished. Retention defaults to 30 days, configurable via &lt;code&gt;LLM_GATEWAY_JOB_RETENTION_DAYS&lt;/code&gt;. Jobs still "running" when the gateway stops are marked &lt;code&gt;orphaned&lt;/code&gt; on next boot, and the partial output stays readable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identical requests within a dedup window short-circuit&lt;/strong&gt; onto the existing running or completed job. The default window is 1 hour, configurable via &lt;code&gt;LLM_GATEWAY_DEDUP_WINDOW_MS&lt;/code&gt;. The "polling timed out, reissue, run it all again" loop is structurally gone. For the case where the prior result is actually wrong and you want a fresh invocation rather than a re-attach, every request tool accepts &lt;code&gt;forceRefresh: true&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The change moves the gateway closer to what I wanted it to be from the start, a durable result-collection layer for CLI agents rather than a thin process spawner that hopes the caller is still listening when the CLI finishes. 20 new tests cover persistence, dedup, restart-orphan, retention, and Grok parity, and the full suite passes at 322 tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes about the original argument
&lt;/h2&gt;

&lt;p&gt;Nothing, actually. The thesis from the first post still stands, that CLI wrapping gives you capabilities (real file access, real test execution, real session state) that API proxying fundamentally cannot. These three updates strengthen the same case rather than contradict it.&lt;/p&gt;

&lt;p&gt;What they fix is the gap between the thesis and the implementation. Codex sessions now carry the same real-CLI continuity as Claude and Gemini. The consensus pattern now has a fourth, vendor-independent voice. And the long-running-job failure mode that always threatened to undercut the whole CLI-spawning approach is gone, because the result lives on disk regardless of who is or is not still polling for it.&lt;/p&gt;

&lt;p&gt;If you are evaluating llm-cli-gateway against an API proxy, the comparison is slightly different now than it was in March, on three specific axes. That seemed worth writing down.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;Mistral shipped Mistral Vibe — their official open-source CLI coding agent, powered by Devstral 2.  Will be adding it next for even more diversity!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;llm-cli-gateway is MIT licensed. npm: &lt;a href="https://npmjs.com/package/llm-cli-gateway" rel="noopener noreferrer"&gt;llm-cli-gateway&lt;/a&gt; | GitHub: &lt;a href="https://github.com/verivus-oss/llm-cli-gateway" rel="noopener noreferrer"&gt;verivus-oss/llm-cli-gateway&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
      <category>cli</category>
    </item>
    <item>
      <title>Here's what stopped breaking, when you make LLM agents author in two formats</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Wed, 06 May 2026 04:39:27 +0000</pubDate>
      <link>https://dev.to/wernerk_au/i-make-llm-agents-author-in-two-formats-heres-what-stopped-breaking-4i0j</link>
      <guid>https://dev.to/wernerk_au/i-make-llm-agents-author-in-two-formats-heres-what-stopped-breaking-4i0j</guid>
      <description>&lt;p&gt;LLM agents will happily produce a thousand lines of plausible Markdown describing work that doesn't compile, isn't tested, and contradicts a decision the same agent wrote down two files earlier. If you want to review their output without re-reading every paragraph, some of the work product has to be machine-checkable.&lt;/p&gt;

&lt;p&gt;You also can't push everything into a schema. Intent, tradeoffs, the alternative you rejected: that material dies in JSON. The interesting question is the boundary. What belongs in prose, what belongs in structure, and what falls out when you draw the line in the wrong place.&lt;/p&gt;

&lt;p&gt;I landed on this after running it for real. I introduced the runtime layer later, when I expanded this to multiple repos, and saw the flat files stopped scaling.&lt;/p&gt;

&lt;h2&gt;
  
  
  The split
&lt;/h2&gt;

&lt;p&gt;Every unit of agent work produces three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Narrative.&lt;/strong&gt; Markdown specs, designs, plans, notes. The human-readable record: intent, tradeoffs, what was rejected, context a future reader needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structure.&lt;/strong&gt; TOML files encoding the work itself: a dependency DAG, a traceability map (&lt;code&gt;INT → FEAT → REQ → DEC → IMP → CODE → TEST → OUT&lt;/code&gt;), and a review-readiness bundle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence.&lt;/strong&gt; Review artifacts that answer &lt;em&gt;"is this actually reviewable, and does the claim match the proof?"&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Markdown carries what structure can't. Intent and reasoning. Why the design has this shape. What was rejected. What the author worried about. Schema fields can't express ambivalence. Specs change during brainstorm and review, and prose is the right medium for that conversation; forcing every change through schema churn throttles thinking. Six months later, the reviewer needs narrative, not a graph.&lt;/p&gt;

&lt;p&gt;TOML carries what prose can't reliably:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Machine-checkable invariants. &lt;code&gt;blocks&lt;/code&gt; is the exact inverse of &lt;code&gt;depends_on&lt;/code&gt;. Every &lt;code&gt;ART:&lt;/code&gt; has exactly one producer. Every &lt;code&gt;consumes&lt;/code&gt; matches a &lt;code&gt;produces&lt;/code&gt;. These are enforced by validators, not by hoping a human noticed.&lt;/li&gt;
&lt;li&gt;Graph queries. &lt;em&gt;What's ready to start? What's the critical path? Which units conflict on files? Which &lt;code&gt;REQ:&lt;/code&gt; has no downstream &lt;code&gt;TEST:&lt;/code&gt;?&lt;/em&gt; These are queries over structure, not reading comprehension.&lt;/li&gt;
&lt;li&gt;Stable identifiers. Prose drifts. &lt;code&gt;U07a&lt;/code&gt;, &lt;code&gt;REQ:auth-001&lt;/code&gt;, &lt;code&gt;ART:schema-v2&lt;/code&gt; don't.&lt;/li&gt;
&lt;li&gt;Diff-readable state. A status transition is a one-line diff, not a paragraph to re-read.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Frame the split as narrative vs. structure, each in the medium that protects its own invariants. Calling it "docs vs. config" gets it wrong because both formats are doing real review-time work; one of them just gets to be checked by &lt;code&gt;python -m&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why TOML and not YAML or JSON
&lt;/h2&gt;

&lt;p&gt;I picked TOML deliberately. YAML loses on parse ambiguity. The &lt;code&gt;country: NO&lt;/code&gt; problem (Norway gets parsed as the boolean &lt;code&gt;false&lt;/code&gt; under YAML 1.1) is real and gets worse when an LLM is generating the file under time pressure. JSON loses on the human-authoring axis: trailing commas explode, every string needs quotes, comments are forbidden. TOML parses unambiguously, reads cleanly enough to author and review by hand, and ships in the Python stdlib (&lt;code&gt;tomllib&lt;/code&gt; since 3.11), so my validators stay dependency-light.&lt;/p&gt;

&lt;p&gt;For agent-authored, human-reviewed structure, TOML is the boring choice. It wins because it's boring.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three review pillars came from failure data
&lt;/h2&gt;

&lt;p&gt;The review-readiness package didn't exist on day one. I added it after running an iteration-chain analysis across seven real review cycles and finding that almost every re-review came from one of three deficiencies, in the same order, over and over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing prerequisite artifacts.&lt;/strong&gt; Review blocked not on conceptual disagreement but on the absence of required planning docs, cross-links, prior diagrams, or test plans. The reviewer couldn't judge readiness because the artifact class wasn't actually complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ambiguous contracts.&lt;/strong&gt; Ordering rules, normalization, precedence, fallback, schema shape: reviewers had to infer semantics the author never wrote down. Every inference round added a re-review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overclaimed completeness.&lt;/strong&gt; "Ready for implementation." "Production ready." "All findings resolved." Unbacked by proof, or backed by proof narrower than the claim. Each one cost another round.&lt;/p&gt;

&lt;p&gt;Three failure modes, three artifacts. A &lt;em&gt;readiness gate&lt;/em&gt; answers whether the artifact class is complete enough to review at all, and blocks opening a review until it passes. A &lt;em&gt;contract declaration&lt;/em&gt; makes behavioral semantics explicit up front so reviewers never have to invent them. An &lt;em&gt;evidence matrix&lt;/em&gt; binds every strong claim to a concrete proof artifact, a stated scope, and a list of known exclusions; a claim broader than its evidence fails validation.&lt;/p&gt;

&lt;p&gt;The workflow is strict and intentionally rude. Fill the readiness gate first; if blocked, don't open review. Fill the contract second; vague statements get rejected. Fill the evidence matrix last; if a claim can't be backed by proof and bounded exclusions, downgrade the claim. Don't stretch the proof.&lt;/p&gt;

&lt;p&gt;The validator's exit code is authoritative. No human override of a failed validation without updating the file to pass cleanly. I made this rule on purpose, because &lt;em&gt;"it's close enough"&lt;/em&gt; was the phrase that caused most of the re-reviews I measured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where flat TOML stopped working
&lt;/h2&gt;

&lt;p&gt;Flat TOML works great for authoring and validation. It stopped working the moment agents started mutating state during execution.&lt;/p&gt;

&lt;p&gt;The hand-calculated &lt;code&gt;[computed]&lt;/code&gt; sections were the first thing to rot. Critical path, conflict groups, progress percentages: all derived values, all authored by hand, all stale the moment a unit advanced. A human spots the inconsistency on re-read. An agent doesn't.&lt;/p&gt;

&lt;p&gt;Editing &lt;code&gt;status = "in_progress"&lt;/code&gt; in a text file leaves no record of when, by whom, from what prior state, against what evidence. For process control, "who moved this to done, and on what proof?" is not optional.&lt;/p&gt;

&lt;p&gt;There was no programmatic query layer either. &lt;em&gt;"Which tier-1 units are runnable right now?"&lt;/em&gt; required parsing TOML, walking the graph in Python, and rebuilding the same derivations every time.&lt;/p&gt;

&lt;p&gt;And flat files don't compose across a fleet. Once more than one repo is under the same policy regime, per-repo TOML is the wrong shape for fleet-wide gating, policy packs, exception lifecycles, and release trains.&lt;/p&gt;

&lt;p&gt;So I added a runtime layer, additively. The templates and validators didn't change.&lt;/p&gt;

&lt;p&gt;A per-repository runtime imports a filled TOML file once. After that, an embedded SurrealDB is the source of truth. Status transitions go through a typed API with validation. Every change persists with timestamps and actor identity. Computed values become live queries instead of hand-edited fields. You can still export a TOML snapshot for human review, but it's a derived artifact, not the authority.&lt;/p&gt;

&lt;p&gt;A fleet-wide control plane (FastAPI + Postgres) handles policy packs, signed snapshot intake, exception lifecycles, and release-train readiness across many repos. There's no flat-file counterpart; the multi-repo problem just isn't expressible in per-repo files.&lt;/p&gt;

&lt;p&gt;The practical rule: TOML is the authoring medium and the interchange format. The database is the runtime authority. The TOML file you imported is stale from the first state transition onward. Treat it like a git tag — a snapshot in time, not live state.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you actually get
&lt;/h2&gt;

&lt;p&gt;Four things, none of which prose-only or structure-only would deliver alone.&lt;/p&gt;

&lt;p&gt;Parallel agent execution without stepping on each other, because the DAG encodes &lt;code&gt;depends_on&lt;/code&gt;, &lt;code&gt;blocks&lt;/code&gt;, and &lt;code&gt;files_modify&lt;/code&gt; conflict groups explicitly. Agents pick runnable units from the same layer and the system knows who may run concurrently.&lt;/p&gt;

&lt;p&gt;Traceability from intent to test. Every requirement has a downstream realization path through implementation, code, and test. Unverified requirements and unmapped code surface as computed gaps in a query, not as gut feeling six weeks into review.&lt;/p&gt;

&lt;p&gt;Reviews that fail at the right boundary. Readiness gates block un-reviewable work before a reviewer sees it. Explicit contracts stop the semantic-inference spiral. Evidence matrices stop overclaimed completeness from reaching review at all.&lt;/p&gt;

&lt;p&gt;State that is queryable, auditable, versioned, and composable across repos. Single-repo: &lt;em&gt;"what's ready now?"&lt;/em&gt; in one query. Fleet-wide: &lt;em&gt;"is this release train green across every repo under policy?"&lt;/em&gt; — also one query, against the control plane.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operating rules
&lt;/h2&gt;

&lt;p&gt;Distilled from getting this wrong before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Author narrative in Markdown. Author structure in TOML. Don't mix.&lt;/li&gt;
&lt;li&gt;Validator exit code 0 is the only pass signal. No manual override.&lt;/li&gt;
&lt;li&gt;Don't edit state fields by hand once they're in the runtime. Use the API.&lt;/li&gt;
&lt;li&gt;Don't claim "complete," "production-ready," or "all findings resolved" without an evidence matrix. If the matrix is thin, the claim is wrong.&lt;/li&gt;
&lt;li&gt;When behavior depends on ordering, fallback, normalization, precedence, or authority, write the contract before review, not during.&lt;/li&gt;
&lt;li&gt;Computed fields belong to the runtime. Don't hand-calculate them.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Worked example: this article
&lt;/h2&gt;

&lt;p&gt;I dogfood the same split. The Anti-AI-Tell style guide (&lt;code&gt;mr-k-man/llm-tips&lt;/code&gt; on GitHub) is Markdown: rationale, evidence base, the prose rules humans read. The matching contract is TOML — 49 machine-checkable rules with regexes, density thresholds, and applicability tags. And the audit workflow is a 10-unit DAG, also in TOML, that orchestrates inventory, scan, triage, fix, and regression as discrete units that run in parallel where the dependency graph permits.&lt;/p&gt;

&lt;p&gt;I ran the DAG on this article before publishing.&lt;/p&gt;

&lt;p&gt;The pre-fix audit found two hits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;AIS:ST02&lt;/code&gt; structural: tricolon-fraction 60% (3 of 5 single-token enumerations were three-item).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AIS:F03&lt;/code&gt; formatting: inline-bold density 1.43 per 200 words (10 bolds in 1398 words; budget 7).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weighted score: 1.0 + 0.25 = 1.25. The rewrite threshold is 3, so this routed to surgical-edit, not rewrite-from-scratch.&lt;/p&gt;

&lt;p&gt;Three line-level edits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stripped four bullet-label &lt;code&gt;**&lt;/code&gt; markers in the "TOML carries..." list. The bullets already carry the structure; the bold was decoration.&lt;/li&gt;
&lt;li&gt;Expanded a three-item prerequisite-artifacts list (docs, cross-links, test plans) to four by adding "prior diagrams".&lt;/li&gt;
&lt;li&gt;Expanded a three-item adjective list (queryable, auditable, composable) to four by adding "versioned". The added word is true: the runtime persists history.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Regression scan: zero hits. Tricolon fraction 1 of 5 (20%, under the 30% threshold). Bold density 0.86 per 200 words (under 1.0). Linter exit 0.&lt;/p&gt;

&lt;p&gt;You're reading the post-fix version. Everything is in &lt;a href="https://github.com/mr-k-man/llm-tips" rel="noopener noreferrer"&gt;&lt;code&gt;mr-k-man/llm-tips&lt;/code&gt;&lt;/a&gt; on GitHub: the source guide at &lt;a href="https://github.com/mr-k-man/llm-tips/blob/main/style_guide.md" rel="noopener noreferrer"&gt;&lt;code&gt;style_guide.md&lt;/code&gt;&lt;/a&gt;, the contract at &lt;a href="https://github.com/mr-k-man/llm-tips/blob/main/tools/style_policy.toml" rel="noopener noreferrer"&gt;&lt;code&gt;tools/style_policy.toml&lt;/code&gt;&lt;/a&gt;, the linter at &lt;a href="https://github.com/mr-k-man/llm-tips/blob/main/tools/lint_writing_style.py" rel="noopener noreferrer"&gt;&lt;code&gt;tools/lint_writing_style.py&lt;/code&gt;&lt;/a&gt;, and the audit DAG at &lt;a href="https://github.com/mr-k-man/llm-tips/blob/main/tools/audit_dag.toml" rel="noopener noreferrer"&gt;&lt;code&gt;tools/audit_dag.toml&lt;/code&gt;&lt;/a&gt;. MIT-licensed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;If you put LLM agents on real work, decide which invariants you want a validator to enforce and which you want a human reviewer to negotiate. Draw that line on purpose. Then accept that flat files have a ceiling: the moment your agents start mutating state, something has to own the audit trail and the live derivations, and a text file isn't it.&lt;/p&gt;

&lt;p&gt;Narrative carries judgement; structure carries invariants. Force either of them to carry live state and you'll lose the audit trail inside a week.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>architecture</category>
    </item>
    <item>
      <title>the next software stack needs more than code generation</title>
      <dc:creator>Werner Kasselman</dc:creator>
      <pubDate>Wed, 22 Apr 2026 04:55:17 +0000</pubDate>
      <link>https://dev.to/wernerk_au/the-next-software-stack-needs-more-than-code-generation-3aep</link>
      <guid>https://dev.to/wernerk_au/the-next-software-stack-needs-more-than-code-generation-3aep</guid>
      <description>&lt;p&gt;Most people in software are staring at the wrong milestone. Models write API handlers, unit tests, and migrations fast enough that typing isn't the limiting factor anymore. In a world of high-concurrency agents, the act of writing code is no longer the bottleneck. That part of the problem is finished.&lt;/p&gt;

&lt;p&gt;The real trouble starts the moment that code lands. Why was this change made? Which requirement forced it? And who actually checked the risky paths in the auth flow? You can still answer those questions today, but it takes a kind of technical archaeology—digging through PR threads, Slack messages, and documentation that was out of date the day it was written. That workflow held up while humans set the pace. It breaks the moment you stop being the bottleneck.&lt;/p&gt;

&lt;h3&gt;
  
  
  the velocity trap
&lt;/h3&gt;

&lt;p&gt;Most teams run AI-assisted development through a loop of prompt, branch, code, review, and merge. At low volume, it holds up. Then usage increases. You start seeing changes that look fine but carry no clear origin story. A feature flag shows up in production with a name nobody recognizes. An environment variable gets added "just to make something work" and stays there for six months because nobody is sure what it’s gating.&lt;/p&gt;

&lt;p&gt;Then we have a growing crowd of "psychosis coders" who think they are shipping masterpieces because they saw an agent move a cursor. They hit approve the second the diff looks plausible, never noticing the trail of empty TODO comments, shallow mocks, and tests that don't actually assert anything meaningful. They are shipping "passable" trash masquerading as velocity.&lt;/p&gt;

&lt;p&gt;Maintaining real quality at agentic speeds requires a gauntlet. In my own work, I have to run Model B against Model A like a caffeine-fueled nitpicker for ten rounds just to reach consensus. Then Model C does the same dance. This cross-model review is mandatory to maintain velocity without the system collapsing into a pile of actual slop.&lt;/p&gt;

&lt;p&gt;But even this gauntlet is a patch, not a solution. We are burning a mountain of tokens to force quality through a pipe that was never meant to handle it. This is "Approval Theater" as a survival strategy. No, your carefully crafted markdowns, prompt engineering nor harness stacking solves this.&lt;/p&gt;

&lt;h3&gt;
  
  
  why clean merges still fail
&lt;/h3&gt;

&lt;p&gt;Agent A updates &lt;code&gt;PricingEngine::price()&lt;/code&gt; to apply a discount based on &lt;code&gt;User::join_date&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Agent B removes &lt;code&gt;join_date&lt;/code&gt; from &lt;code&gt;User&lt;/code&gt; and introduces a &lt;code&gt;UserMetadata&lt;/code&gt; lookup that returns &lt;code&gt;Option&amp;lt;NaiveDate&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The pricing path now depends on a value that may not exist. In the failure case, the lookup returns &lt;code&gt;None&lt;/code&gt;, and a later fallback resolves that missing value to &lt;code&gt;Money::default()&lt;/code&gt;, producing &lt;code&gt;0.00&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Both changes compile. Both pass their unit tests. Because they don't touch the same lines of code, Git merges them without a single conflict.&lt;/p&gt;

&lt;p&gt;In production, the pricing logic fails. Revenue doesn't drop to zero. That would be obvious. It becomes inconsistent instead. Some users are charged correctly. Others hit the missing metadata path and get a zero price. Support tickets appear first. Finance notices the reconciliation mismatch three weeks later.&lt;/p&gt;

&lt;p&gt;You're left trying to unwind two changes that were never evaluated together. Each was correct in isolation; the failure only existed in the interaction. A human developer might have caught that by holding the context in their head, but that assumption doesn't scale when dozens of agents are moving at once.&lt;/p&gt;

&lt;h3&gt;
  
  
  the idempotency crisis
&lt;/h3&gt;

&lt;p&gt;There is a deeper, uglier problem with agents and Git: retries. When a prompt fails or a network timeout hits, an agent often tries again. In a standard Git flow, this leads to double-commits, "dirty" working directories, or a messed-up HEAD state that requires a human to untangle. Then come additional worktrees and agents not checking if they're on the right branch in the right tree, or simply sticking to documentation paths you've specified instead of pollution the root with markdowns. &lt;/p&gt;

&lt;p&gt;Git wasn't built for idempotent operations from a thousand concurrent workers. It was built for a human at a terminal who can see when a command failed. If the next stack doesn't have request-level idempotency built into the storage layer, you aren't building a system; you're building a race condition.&lt;/p&gt;

&lt;h3&gt;
  
  
  files are the wrong primitive now
&lt;/h3&gt;

&lt;p&gt;Git shows you what changed in the text, but it doesn't show you why. You see two files modified, but you can’t see the requirement that triggered the edit. We review diffs and guess at intent. &lt;/p&gt;

&lt;p&gt;Agents don't operate on files; they operate on relationships. A discount rule depends on a user attribute; a billing flow depends on an auth decision. When we take that rich graph of intent and flatten it into files, we lose the fidelity of the work. This mismatch leads to "clean" merges that are semantically murky, repeated edits to the same symbols, and retries that converge on something other than what we actually meant to build.&lt;/p&gt;

&lt;h3&gt;
  
  
  building the floor
&lt;/h3&gt;

&lt;p&gt;I'm building a stack that treats intent as the primary object, not the diff. It's not one tool. It's a set of components doing work Git was never designed for.&lt;/p&gt;

&lt;p&gt;aivcs is the version control core: a 9-crate Rust workspace. It uses blake3 for content-addressed hashing and groups changes around intent as an Episode instead of scattering them across commits. An Episode carries the requirement that triggered the change, the symbols actually touched, and the evidence (tests, benchmarks, profiles) attached when the work lands. It can import Git history as a baseline and export structured Episodes back into a branch, so teams don’t have to migrate all at once.&lt;/p&gt;

&lt;p&gt;trstr is the parsing layer. It’s spec-grounded, not grammar-by-example. When an agent edits a symbol, the system knows what that symbol is, not just which bytes moved. Tree-sitter is built for editor features. This needs stricter guarantees.&lt;/p&gt;

&lt;p&gt;sqry handles symbol-level indexing. It builds the graph from a rule like “apply a legacy discount” to every call site, call chain, and dependent type that touches it. That’s what lets an Episode carry semantic scope instead of a file list. It’s also how you catch the &lt;code&gt;PricingEngine&lt;/code&gt; / &lt;code&gt;UserMetadata&lt;/code&gt; class of failure before merge.&lt;/p&gt;

&lt;p&gt;wsmux is the concurrency layer: a CRDT over the code graph. When dozens of agents edit the same repository, the merge surface isn’t text. It’s operations on symbols and relationships. wsmux makes those edits converge instead of producing two clean merges that disagree at runtime.&lt;/p&gt;

&lt;p&gt;The storage layer is idempotent by construction. The same operation with the same content and intent resolves to the same Episode. Retries don’t duplicate work. A thousand workers hitting a flaky network stop being a race condition.&lt;/p&gt;

&lt;p&gt;This doesn’t replace Git. It sits alongside it.&lt;/p&gt;

&lt;p&gt;The goal is simple: when something changes, you can answer why without digging through history. Decisions travel with the change. Evidence is attached when the change is made, not reconstructed later.&lt;/p&gt;

&lt;p&gt;The system remembers what changed. It should also remember why.&lt;/p&gt;

&lt;p&gt;The bottleneck moved. The stack didn’t. That gap is where the risk lives.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>architecture</category>
      <category>startup</category>
    </item>
  </channel>
</rss>
