The City-State and the Federation: Two Governance Models for AI Coding Agents

#ai #automation #devops #architecture

Why I am writing this

This is the third piece in an accidental series about convergent evolution in agent tooling, and I think it is the most useful one, because this time the two systems being compared are not merely neighbours in the same field, they are the same species of thing: governance systems for AI coding agents, built in the same quarter, by people who have never spoken, with overlapping mechanisms and almost perfectly complementary blind spots. In the first article I described my DAG TOML stack, plans as machine-checkable claims with validators and a fleet control plane behind them, and in the second I compared two orchestrators. This one is about dgov by James H. Gearon, which describes itself as a "deterministic kernel for multi-agent orchestration via git worktrees". I should be straight about my method: I did not read the source line by line myself. I had my agents clone it and do the close reading (roughly 20,000 lines of Python across 70 modules, with 70 test files and a benchmarks document) and I worked from their structured analysis, the project's own documentation and the schema excerpts they pulled, which, given the subject of this article, feels less like a shortcut and more like a demonstration.

The usual disclaimer applies, doubled: I built one of the two systems, I have neither run nor personally read the other end to end, and any misreadings of dgov are mine (or my agents', which contractually is still mine). Take this as one practitioner reading a rival constitution with admiration, a highlighter and a research staff, nothing more.

Two metaphors, both load-bearing

The first thing that struck me reading dgov is that it is built on a legal metaphor, and the metaphor is structural rather than decorative. There is a governor charter (governor.md, "Plan first. Respect file claims. Fail closed."), standard operating procedures as statute, an append-only ledger whose entries include a category literally called case law, prompt sections injected into workers under the heading of probation, an error type named ConstitutionalViolation, and ten documented design pillars covering separation of powers and fail-closed defaults. The probabilistic worker implements; the deterministic governor plans, validates, reviews and merges. It is a constitution with an enforcement arm.

My stack runs on a different metaphor, scientific audit: plans are claims, validators attempt to refute them, completion requires evidence, and a control plane above many repositories evaluates everything against policy. Law versus science, enforcement versus refutation. Both metaphors earn their keep, and the differences between the two systems fall out of the metaphors with surprising neatness.

What a plan is

In dgov, a plan is a TOML tree compiled to a DAG, and each task carries it's own prompt, the actual work order, alongside file claims (files.create, files.edit, files.read and so on), dependencies, a test command, a role (worker, researcher or reviewer), an iteration budget and a set of tag-matched SOPs that get prepended to the prompt. The plan is directly dispatchable: compile it, and workers in isolated git worktrees start executing it. Compilation is fail-closed, cycles and unreachable units and malformed sections are rejected before anything runs.

In my stack the plan deliberately contains no prompt at all. A unit carries contracts instead: acceptance criteria, constraints, failure modes, critical decisions, produced and consumed artefacts, and a [computed] section in which the author must commit to derived claims (critical path, per-layer parallelism, totals) that a validator independently recomputes and diffs. The plan is not a work order, it is a reviewable artefact that can be refuted before anyone executes it.

So dgov closes the loop from plan to execution, and mine closes the loop from plan to review, and neither closes both. That asymmetry runs through everything else.

The thing dgov does that I do not

Credit first, because this is the part that made me sit up. At settlement time, dgov diffs the worktree and compares the files an agent actually touched against the files the task claimed it would touch, and the comparison is merciless: unclaimed paths reject the merge, reserved paths fail closed, and even reading outside the declared read scope is caught and surfaced. Git is the source of truth, and the claim is checked against reality mechanically, every time, with no human in the loop.

I have to concede this carefully, because the first draft of this paragraph conceded it wrongly. My plan runtime does not do that: my validators refute a plan's self-consistency (a declared critical path that is not the longest path fails, an artefact with two producers fails), and my evidence matrices require completion claims to name a proof with declared scope and known exclusions, but when a unit is marked done, nothing mechanically diffs the declared file claims against what actually changed. The honest complication is that the mechanism does exist elsewhere in my stack: my version-control layer, aivcs, records the symbols actually touched in each Episode and attaches evidence with a freshness lifecycle, which is claim-versus-reality binding at symbol granularity, finer than dgov's file granularity. What I am missing is not the mechanism, it is the wiring: the plan runtime and the version-control layer do not yet check each other. dgov verifies what happened against what was claimed in one continuous motion; I have both halves of that theorem proved in separate buildings. Those are different failure modes, and his is the better one.

dgov has two more mechanisms worth respecting. Its semantic settlement layer does AST-level analysis of integration candidates before merging, with a failure taxonomy of its own (text conflicts, concurrent edits to the same symbol, duplicate definitions, signature drift, ordering conflicts, and a category called behavioural mismatch), which I found quietly delightful, because building a failure taxonomy and then mechanising it is exactly the move my whole stack came from, except he aimed it at merge integration whilst I aimed it at review iteration. I will come back to that taxonomy below, because when I checked it against my own cupboard the comparison surprised me in both directions. And the kernel itself is a pure function from state and event to new state and actions, no I/O, explicit dispatch table, everything event-sourced to SQLite and an append-only deploy log, which means a run is deterministically replayable in a way my live-database runtime is not. There is even an autofix phase (mechanical lint fixes applied before the validation gates run), which saves the expensive kind of retry where an agent burns an iteration fixing a formatting complaint.

The thing I do that dgov does not

The complementary gaps are just as clean. dgov has no recomputable derived claims, so a plan whose declared structure is internally wrong in ways a topological check cannot see (an inflated parallelism story, a schedule that ignores the true critical path) executes anyway. It has no artefact dataflow, no produces and consumes with single-producer ownership, so the failure class where two units quietly both own the canonical definition (the one that once cost me thirteen review iterations) has no mechanical guard. Its reviewer role is explicitly bounded to the diffs of dependency tasks, one model provider, no multi-model adversarial review, where my process was born precisely from independent reviewers (Codex, Gemini and Claude) disagreeing productively. Its acceptance story is a test command's exit code, and as I wrote in the first article, half of my December pain came from tests that existed but could not fail, which is exactly the weakness an exit-code gate cannot see and an evidence matrix with known exclusions is built to catch.

And dgov is constitutionally a city-state. One repository, one .dgov/ directory, one governor. It governs its territory completely and stops at the border. My control plane is the federation layer: policy packs and requirement profiles defined once, per-repository agents pushing signed snapshots, evaluation history, exception lifecycles, release trains across many repositories. dgov has no analogue, and frankly does not claim to want one, but the moment you run agents across a fleet the federation question arrives whether you invited it or not.

The convergence list grows

With the previous article's comparison included, there are now three solo builders (wpank with Bardo, Gearon with dgov, and me) who independently arrived at: declarative task units with explicit dependencies, file claims per task as the precondition for safe parallelism, fail-closed validation before execution, topological ordering, per-task verification commands, an append-only event history, and failure memory carried forward into future attempts (his ledger case law, Bardo's do-not-retry lists, my deficiency taxonomy). One small coincidence I cannot resist recording: the day dgov's git history was re-bootstrapped for worktree isolation is the same day I authored my first DAG TOML. Nothing connects the two events except the season, which is rather the point.

When isolated builders keep meeting at the same mechanisms, the mechanisms are telling you something about the problem, not about the builders. File claims, fail-closed gates and forwarded failure memory now look to me like the arch and the keystone of this field, the parts every serious system will have because the load demands them.

What I am taking home

I finished reading dgov with a shopping list, which is the highest compliment I know how to pay another person's codebase:

Claim-versus-reality settlement in the plan runtime. My runtime should refuse to mark a unit done while the actual touched files disagree with the unit's declared file sets, exactly as dgov's review sandbox does, and since my version-control layer already records touched symbols and attached evidence, the work here is plumbing rather than invention. Still the single highest-value import.
The placement of merge analysis, not the taxonomy itself. My first draft of this list said I should import his merge taxonomy, and then I went and audited my own shelves: my semantic merge engine already covers his categories and more (manifest-driven conflict policy per language, tiered degradation down to plain git merge when parsing fails, and a commutativity algebra that formalises what he calls ordering conflicts), and my code-graph layer detects signature drift and duplicate definitions independently. What dgov actually taught me is where to stand: he runs merge analysis as a settlement gate inside the plan runtime, every task, every time, whilst my deeper machinery sits in a separate layer that the plan runtime never consults. The import is the wiring, his architecture carrying my components.
Fail-closed policy parsing. dgov rejects malformed SOPs at compile time, required front matter, required sections, no exceptions, and my template ecosystem should hold its own policy documents to the same standard it already holds plans.

And one observation rather than an import. The most interesting entry in his failure taxonomy is behavioural mismatch, the case where two changes merge cleanly and disagree only at runtime, which is exactly the failure I wrote up in an earlier piece (a pricing path quietly depending on a field another agent had removed, both sides compiling, both passing their tests, git merging without a murmur). dgov's taxonomy names that crime but cannot yet detect it, because detection needs a relationship graph (which callers depend on which symbols) rather than a diff, and that graph is precisely what the symbol-indexing and predicate layers of my stack exist to provide. The city-state names the crime; the federation has the forensics. Neither system has secured a conviction yet, and I suspect whoever gets there first gets there with both halves.

If Gearon ever reads my side of this, the reciprocal list is above: refutable derived claims, artefact ownership, evidence with declared exclusions, and a story for the day dgov needs to govern more than one city. And since the comparison should be checkable rather than taken on trust, my side of the format is a public draft specification at agent-assurance.dev, with independent Rust, Go and Python validators, should anyone (including him) want to implement against it.

Thanks for reading this far, I hope you find some value in the comparison. If you are building agent governance of your own, whether it leans towards law or towards science, I would genuinely like to hear which theorems you chose to prove mechanically, and which ones you are still taking on trust.