Werner Kasselman

Posted on May 6

Here's what stopped breaking, when you make LLM agents author in two formats

#ai #architecture #llm #devops

LLM agents will happily produce a thousand lines of plausible Markdown describing work that doesn't compile, isn't tested, and contradicts a decision the same agent wrote down two files earlier. If you want to review their output without re-reading every paragraph, some of the work product has to be machine-checkable.

You also can't push everything into a schema. Intent, tradeoffs, the alternative you rejected: that material dies in JSON. The interesting question is the boundary. What belongs in prose, what belongs in structure, and what falls out when you draw the line in the wrong place.

I landed on this after running it for real. I introduced the runtime layer later, when I expanded this to multiple repos, and saw the flat files stopped scaling.

The split

Every unit of agent work produces three things:

Narrative. Markdown specs, designs, plans, notes. The human-readable record: intent, tradeoffs, what was rejected, context a future reader needs.
Structure. TOML files encoding the work itself: a dependency DAG, a traceability map (INT → FEAT → REQ → DEC → IMP → CODE → TEST → OUT), and a review-readiness bundle.
Evidence. Review artifacts that answer "is this actually reviewable, and does the claim match the proof?"

Markdown carries what structure can't. Intent and reasoning. Why the design has this shape. What was rejected. What the author worried about. Schema fields can't express ambivalence. Specs change during brainstorm and review, and prose is the right medium for that conversation; forcing every change through schema churn throttles thinking. Six months later, the reviewer needs narrative, not a graph.

TOML carries what prose can't reliably:

Machine-checkable invariants. blocks is the exact inverse of depends_on. Every ART: has exactly one producer. Every consumes matches a produces. These are enforced by validators, not by hoping a human noticed.
Graph queries. What's ready to start? What's the critical path? Which units conflict on files? Which REQ: has no downstream TEST:? These are queries over structure, not reading comprehension.
Stable identifiers. Prose drifts. U07a, REQ:auth-001, ART:schema-v2 don't.
Diff-readable state. A status transition is a one-line diff, not a paragraph to re-read.

Frame the split as narrative vs. structure, each in the medium that protects its own invariants. Calling it "docs vs. config" gets it wrong because both formats are doing real review-time work; one of them just gets to be checked by python -m.

Why TOML and not YAML or JSON

I picked TOML deliberately. YAML loses on parse ambiguity. The country: NO problem (Norway gets parsed as the boolean false under YAML 1.1) is real and gets worse when an LLM is generating the file under time pressure. JSON loses on the human-authoring axis: trailing commas explode, every string needs quotes, comments are forbidden. TOML parses unambiguously, reads cleanly enough to author and review by hand, and ships in the Python stdlib (tomllib since 3.11), so my validators stay dependency-light.

For agent-authored, human-reviewed structure, TOML is the boring choice. It wins because it's boring.

The three review pillars came from failure data

The review-readiness package didn't exist on day one. I added it after running an iteration-chain analysis across seven real review cycles and finding that almost every re-review came from one of three deficiencies, in the same order, over and over.

Missing prerequisite artifacts. Review blocked not on conceptual disagreement but on the absence of required planning docs, cross-links, prior diagrams, or test plans. The reviewer couldn't judge readiness because the artifact class wasn't actually complete.

Ambiguous contracts. Ordering rules, normalization, precedence, fallback, schema shape: reviewers had to infer semantics the author never wrote down. Every inference round added a re-review.

Overclaimed completeness. "Ready for implementation." "Production ready." "All findings resolved." Unbacked by proof, or backed by proof narrower than the claim. Each one cost another round.

Three failure modes, three artifacts. A readiness gate answers whether the artifact class is complete enough to review at all, and blocks opening a review until it passes. A contract declaration makes behavioral semantics explicit up front so reviewers never have to invent them. An evidence matrix binds every strong claim to a concrete proof artifact, a stated scope, and a list of known exclusions; a claim broader than its evidence fails validation.

The workflow is strict and intentionally rude. Fill the readiness gate first; if blocked, don't open review. Fill the contract second; vague statements get rejected. Fill the evidence matrix last; if a claim can't be backed by proof and bounded exclusions, downgrade the claim. Don't stretch the proof.

The validator's exit code is authoritative. No human override of a failed validation without updating the file to pass cleanly. I made this rule on purpose, because "it's close enough" was the phrase that caused most of the re-reviews I measured.

Where flat TOML stopped working

Flat TOML works great for authoring and validation. It stopped working the moment agents started mutating state during execution.

The hand-calculated [computed] sections were the first thing to rot. Critical path, conflict groups, progress percentages: all derived values, all authored by hand, all stale the moment a unit advanced. A human spots the inconsistency on re-read. An agent doesn't.

Editing status = "in_progress" in a text file leaves no record of when, by whom, from what prior state, against what evidence. For process control, "who moved this to done, and on what proof?" is not optional.

There was no programmatic query layer either. "Which tier-1 units are runnable right now?" required parsing TOML, walking the graph in Python, and rebuilding the same derivations every time.

And flat files don't compose across a fleet. Once more than one repo is under the same policy regime, per-repo TOML is the wrong shape for fleet-wide gating, policy packs, exception lifecycles, and release trains.

So I added a runtime layer, additively. The templates and validators didn't change.

A per-repository runtime imports a filled TOML file once. After that, an embedded SurrealDB is the source of truth. Status transitions go through a typed API with validation. Every change persists with timestamps and actor identity. Computed values become live queries instead of hand-edited fields. You can still export a TOML snapshot for human review, but it's a derived artifact, not the authority.

A fleet-wide control plane (FastAPI + Postgres) handles policy packs, signed snapshot intake, exception lifecycles, and release-train readiness across many repos. There's no flat-file counterpart; the multi-repo problem just isn't expressible in per-repo files.

The practical rule: TOML is the authoring medium and the interchange format. The database is the runtime authority. The TOML file you imported is stale from the first state transition onward. Treat it like a git tag — a snapshot in time, not live state.

What you actually get

Four things, none of which prose-only or structure-only would deliver alone.

Parallel agent execution without stepping on each other, because the DAG encodes depends_on, blocks, and files_modify conflict groups explicitly. Agents pick runnable units from the same layer and the system knows who may run concurrently.

Traceability from intent to test. Every requirement has a downstream realization path through implementation, code, and test. Unverified requirements and unmapped code surface as computed gaps in a query, not as gut feeling six weeks into review.

Reviews that fail at the right boundary. Readiness gates block un-reviewable work before a reviewer sees it. Explicit contracts stop the semantic-inference spiral. Evidence matrices stop overclaimed completeness from reaching review at all.

State that is queryable, auditable, versioned, and composable across repos. Single-repo: "what's ready now?" in one query. Fleet-wide: "is this release train green across every repo under policy?" — also one query, against the control plane.

Operating rules

Distilled from getting this wrong before:

Author narrative in Markdown. Author structure in TOML. Don't mix.
Validator exit code 0 is the only pass signal. No manual override.
Don't edit state fields by hand once they're in the runtime. Use the API.
Don't claim "complete," "production-ready," or "all findings resolved" without an evidence matrix. If the matrix is thin, the claim is wrong.
When behavior depends on ordering, fallback, normalization, precedence, or authority, write the contract before review, not during.
Computed fields belong to the runtime. Don't hand-calculate them.

Worked example: this article

I dogfood the same split. The Anti-AI-Tell style guide (mr-k-man/llm-tips on GitHub) is Markdown: rationale, evidence base, the prose rules humans read. The matching contract is TOML — 49 machine-checkable rules with regexes, density thresholds, and applicability tags. And the audit workflow is a 10-unit DAG, also in TOML, that orchestrates inventory, scan, triage, fix, and regression as discrete units that run in parallel where the dependency graph permits.

I ran the DAG on this article before publishing.

The pre-fix audit found two hits:

AIS:ST02 structural: tricolon-fraction 60% (3 of 5 single-token enumerations were three-item).
AIS:F03 formatting: inline-bold density 1.43 per 200 words (10 bolds in 1398 words; budget 7).

Weighted score: 1.0 + 0.25 = 1.25. The rewrite threshold is 3, so this routed to surgical-edit, not rewrite-from-scratch.

Three line-level edits:

Stripped four bullet-label ** markers in the "TOML carries..." list. The bullets already carry the structure; the bold was decoration.
Expanded a three-item prerequisite-artifacts list (docs, cross-links, test plans) to four by adding "prior diagrams".
Expanded a three-item adjective list (queryable, auditable, composable) to four by adding "versioned". The added word is true: the runtime persists history.

Regression scan: zero hits. Tricolon fraction 1 of 5 (20%, under the 30% threshold). Bold density 0.86 per 200 words (under 1.0). Linter exit 0.

You're reading the post-fix version. Everything is in mr-k-man/llm-tips on GitHub: the source guide at style_guide.md, the contract at tools/style_policy.toml, the linter at tools/lint_writing_style.py, and the audit DAG at tools/audit_dag.toml. MIT-licensed.

Takeaway

If you put LLM agents on real work, decide which invariants you want a validator to enforce and which you want a human reviewer to negotiate. Draw that line on purpose. Then accept that flat files have a ceiling: the moment your agents start mutating state, something has to own the audit trail and the live derivations, and a text file isn't it.

Narrative carries judgement; structure carries invariants. Force either of them to carry live state and you'll lose the audit trail inside a week.

DEV Community