Werner Kasselman

Posted on Jun 4

DAG TOML: How I Turned Four Months of Code-Review Pain into a Machine-Checkable Planning Format

#ai #devops #codequality #automation

Everything below is date-anchored, because the dates matter to the story: I first put agent rules in TOML in October 2025, the failure data runs from December 2025 to March 2026, the first DAG TOML was authored on 2 April 2026, the archive analysis that justified it ran on 4 April 2026, and the database-backed runtime followed across April and May 2026.

Why I am sharing this

I thought people might find this interesting, and hopefully it saves somebody else a few wasted review rounds, because the cost of the problem I am about to describe is mostly invisible until you sit down and add it up. I run a multi-agent development process where LLM agents (Claude, Codex CLI and Gemini CLI, to name a few) plan, implement and cross-review each other's work on a Rust codebase, and every work product goes through independent review by at least two different model families before it merges.

I am not a process-methodology researcher and I have no business publishing failure taxonomies, so please take this as nothing more than me sharing what I found in my own review archive, and what I changed because of it.

The system works, frankly better than I expected when I started, but through late 2025 it had a churn problem: work kept bouncing back for rereview, and every bounce burned a full review round across multiple models. So in April 2026 I did something slightly unusual, I treated my own review archive (roughly 2,400 review documents) as a dataset and asked the obvious question: why does work actually bounce?

This article shows one real chain from that dataset (the December one), the taxonomy that fell out of the analysis, and the fix: implementation plans written as TOML DAGs with mechanical validators, so that an entire class of review findings became exit 1 instead of a week of iteration.

Exhibit A: the project-persistence chain (5 and 6 December 2025)

The feature was unglamorous: persist a code-index project's in-memory state (repo index, file table, symbol index) to disk on teardown and reload it on startup, the kind of thing that should be a one-pass review.

The paper trail, fully dated:

5 December 2025 - Spec written and approved, with a full planning pack behind it: spec, design, implementation plan and test plan. Concrete targets: warm restore after restart, persist in under 750 ms for a 50k-symbol index, at least 80% module coverage.
5 December 2025 - Nine pre-implementation review iterations across three models (3 by Codex, 2 by Gemini, 4 by Claude) before a single line of code was written.
6 December 2025 - Implementation done. Two independent post-implementation reviews. Both returned REQUEST CHANGES.
6 December 2025 - Fix iteration, second review round, approved the same day.

What did two reviewers find on 6 December, after all that planning?

Severity	Finding
HIGH	The restore path overwrote every file's repo ID with `NONE`, the persisted ID was simply ignored, so reloaded state was detached from its repositories. The feature's entire purpose silently didn't work, and a `TODO` in the code acknowledged it.
HIGH	The cache directory from config was trusted verbatim, which meant absolute paths and `..` segments could write state outside the project root. Path traversal, despite the spec explicitly constraining writes to the project root.
MEDIUM	The config fingerprint (used to invalidate stale persisted state) hashed only 4 of the 7 config fields that affect indexing, so changing the others silently reused stale state.
MEDIUM	The "concurrency test" spawned four threads on four separate directories. Same-root races: untested.
MEDIUM	No test ever persisted and restored an actual symbol index, so the headline requirement was unverified.
LOW	The file was fsynced but the containing directory was not, so a crash after rename could lose the file after logging success.

Both reviewers, independently and from different model families, converged on the same top finding. The second round on 6 December fixed everything with a verification table mapping each finding to specific code and a named test, and it was approved same-day.

Here is the uncomfortable part: the planning was thorough, the planning reviews were thorough, and the implementation still shipped with it's core feature non-functional and a path-traversal hole. Plans written in prose don't bind implementations, and reviews of prose can't be rerun.

Mining the archive (4 April 2026)

I analysed seven full "iteration chains" (initial request, blocking reviews, rereviews, final approval) spanning December 2025 to March 2026, plus nine clean one-pass approvals as a control group:

December 2025 - project persistence (above); plugin polish across 4 language plugins ("production-ready" claimed whilst the test matrix said otherwise); and a follow-up where the tests existed but couldn't fail, because non-strict assertions passed even with the feature absent
December 2025 to January 2026 - a privacy-sensitive planning pack that took 13 iterations, mostly because no single canonical schema existed early and definitions drifted across documents
10 February 2026 - a policy standard blocked on MUST/SHOULD conflicts and a precedence model that let task instructions override security controls
February 2026 - a C++ language feature claiming "complete support" whilst its own status docs still described failing tests
10 March 2026 - a planning pack that burned review rounds 6 and 7 on a missing artefact family and an "ordering is deterministic" claim with no stated ordering rule

Every rereview cause fit one of six categories:

Missing artefact completeness - required docs absent, found by the reviewer
Unstated contracts - "deterministic", "compatible", "safe", with no rule written anywhere
Drifted contracts - the same concept defined differently across documents
Evidence gaps - claims broader than tests, and "resolved" without proof
Boundary rules missing from the design - no privacy, security or filesystem constraints stated
Boundary rules stated but not enforced - the December path-traversal case, exactly

And the clean one-pass approvals (all nine of them) shared four traits: bounded scope, already-explicit contracts, evidence matched to claims, and reviewer comments that were refinements rather than prerequisites.

Notice what the six categories have in common: almost none of them are code bugs. They are plan-shaped defects, and they are checkable before a reviewer ever looks.

The fix: plans as DAGs, in TOML, with a validator (2 April 2026)

The first DAG TOML was authored on 2 April 2026, and the extracted templates and validators followed on 4 April, the same day as the archive analysis. TOML itself was not new to me, I had been putting agent rules in TOML since 12 October 2025 (a [rules] never/always prompt policy in one of my Rust projects, with trigger-activated context sections and token budgets), but all through the December-to-March churn the plans themselves stayed in prose, and April was when the plans became TOML too. I know that a TOML schema for plans might sound like process for the sake of process, but the format makes every plan claim one of three things: a required field, a recomputable assertion, or a gated state transition.

A plan is a set of units:

[units.U02]
name = "extract-initial-chain-set"
layer = 1
tier = 1
status = "done"             # pending | in_progress | done | blocked | deferred
depends_on = ["U01"]
blocks = ["U04"]
estimated_loc = 160
files_modify = ["research/ANALYSIS_FINDINGS.md"]
acceptance = [
  "At least five completed chains are analysed with explicit rereview causes.",
]
produces = ["ART:initial-chain-findings"]
consumes = ["ART:batch-scope"]
critical_decisions = ["Distinguish content defects from process defects."]
constraints = ["Only count deficiencies that materially forced another iteration."]
failure_modes = ["If extraction drifts into generic summaries, the taxonomy loses causal value."]

acceptance, constraints, failure_modes and critical_decisions are required, per unit. Category 2 (unstated contracts) stops being something a reviewer must notice by absence, it becomes a missing required field.

Then the plan must declare its own derived properties:

[computed]
entry_points = ["U01"]
leaf_nodes = ["U05"]
critical_path = ["U01", "U02", "U04", "U05"]
critical_path_loc = 420
[computed.max_parallel]
layer1 = 2

And here is the entire trick: a roughly 500-line Python validator (standard library only, tomllib does the parsing) recomputes every one of those claims from the units table and diffs them.

blocks must be the exact inverse of depends_on, so editing one side of a dependency and forgetting the other fails validation with the exact mismatch
cycles are detected and printed as the actual cycle path
every ART: artefact must have exactly one producer, so the "who owns the canonical definition" drift that cost 13 iterations in January becomes a one-line error
every consumes must match an existing produces, so hidden dependencies surface as holes in the plan
a depender must sit in a strictly higher layer than its dependencies, so overstated parallelism fails
the declared critical path must be a chain of real edges, start at an entry point, end at a leaf, and match the true longest weighted path (recomputed via toposort), so schedule fantasy fails
units sharing files must be declared in conflict groups, so two parallel agents about to edit the same file is caught at plan time
files_modify paths must exist in the repo, so plans written against an imagined codebase fail
placeholders (<fill-in-later>) are rejected outright

A wrong plan claim is no longer a reviewer judgement call, it is a failed assertion with a one-line diff.

What it changed in review (4 April 2026, first live use)

Two days after the format existed, the first DAG-reviewed plan went through: a plugin cost-tiering feature. The reviewer's scope line was the TOML file itself, and the verdict was APPROVED in one pass with zero blocking issues, where all four reviewer comments were genuine domain risks (legacy manifest fallback semantics and plugin ID stability, to name a few) rather than structural gaps.

That is the mechanism working as intended: the structural questions reviewers used to burn rounds on, is anything missing, do the dependencies make sense, what can actually run in parallel, does the timeline claim hold, are pre-answered by validator before the review is even requested, which leaves the reviewer's whole attention for the hard semantic findings, and frankly that is the only thing humans and frontier models should be spending review rounds on.

And the December bug class? Gates and evidence matrices

To be clear, the DAG validator alone would not have caught the December path traversal, the reviewers did that, and that finding is category 6 (boundary stated but not enforced in code). Two companion formats target it:

Contract declarations - any plan touching filesystems, ordering, compatibility or fallback must declare the contract explicitly (path-root confinement, traversal handling, atomicity), and each contract names what verifies it.
Evidence matrices - a "finding resolved" or "feature complete" claim must bind a claim ID to an evidence path plus declared scope plus known exclusions, and the validator checks the evidence file actually exists. You mechanically cannot say "resolved" without naming a proof that could fail, and if you remember the December tests that couldn't fail, that is exactly the failure mode this kills.

The December chain's second review (the one that passed) was already an informal evidence matrix, every prior finding mapped to specific code lines and a named test. The format just makes that table mandatory, machine-checked, and required before the review is requested instead of produced during round 2.

Where it went next (April and May 2026)

Static validation only catches problems when someone runs it. In April and May 2026 the same four invariants moved into a database-backed runtime, where agents import the TOML once and all state lives in the database:

a unit is only offered to an agent when every dependency is done
status changes are guarded transitions with history, not string edits
the inverse-edge, single-producer, consumes-has-producer and layer-ordering invariants are enforced at mutation time
readiness gates are a query, "is this bundle reviewable?", answered from data before a review request is ever sent

A nod to the neighbours

After publishing the first version of this piece I went looking for who else had walked this road, and the honest answer is that I was not alone, and in some respects I was not first either. gptme (Erik Bjäreholt's terminal agent) was putting agent context and workspace configuration into a project-level gptme.toml long before I wrote my first agent rule, and its agent workspaces (tasks, journal, lessons, all git-tracked) are a thoughtful take on the same persistence problem my runtime addresses. lok defines declarative multi-backend LLM workflows in TOML, [[steps]] with depends_on, retries and consensus thresholds, which is DAG-in-TOML for orchestration, done cleanly. dgov (James H. Gearon) is the closest cousin of the lot: TOML plan trees with task dependencies, compiled to DAGs and dispatched to agents in isolated git worktrees with settlement gates on the way back in. The Bardo write-up ("Building the Machine That Builds the Machine") describes 115 dependency-chained plans and around a hundred task TOMLs feeding agent swarms, the same shape at a scale that makes mine look modest. And aura from the Mezmo team composes whole agents from declarative TOML.

What strikes me most is the synchronicity of it. None of these projects reference each other, and I found them only after building mine, yet several teams independently reached for the same move within the same season: take the parts of agent work that used to live in prose and conversation, and push them into a declarative, diffable, machine-readable format. I do not think that is coincidence, I think it is convergence, because anyone running agents at volume eventually collides with the same wall (plans and claims that read beautifully and bind nothing), and TOML happens to sit in the sweet spot of human-writable and machine-checkable. Credit where it is due to all of these teams for getting there on their own paths. If my contribution adds anything on top, it is the validator-first posture: not just expressing the DAG in TOML, but making the plan declare claims that a validator can independently recompute and refute.

Takeaways

Your review archive is a dataset. Seven failure chains and nine clean approvals were enough to find six stable failure categories, and they were stable across different reviewer models, which was the signal that they were real.
Most rereview causes are plan defects, not code defects. Plans in prose can't be validated, plans as data can.
Force derived claims, then recompute them. The [computed] section is the idea that pays for everything else here, because making the author commit to parallelism, critical path and totals turns optimism into a checkable assertion.
"Resolved" must name a proof that could fail. Half of December's pain was tests that existed but couldn't catch the bug they claimed to cover.
Spend reviewer rounds only on what machines can't check. After the switch, my first DAG-reviewed plan went through in one pass, with the reviewer's whole budget spent on real domain risk.

The format described here is no longer internal: DAG-TOML is now a public draft specification at agent-assurance.dev, with independent Rust, Go and Python validators, worked examples, and profile extension points, released under the verivus-oss/agent-assurance repository. The database runtime and the fleet control plane remain internal for now, but the schema ideas (required contract fields, recomputed [computed] sections, single-producer artefacts, evidence matrices, closure roots) are all in the spec, and you can validate a file against it today.

Thanks for reading this far, I hope you find some value in my story. If you have mined your own review archive (and specifically the rereview causes), I would genuinely like to hear what categories you found.

DEV Community