DEV Community: Vasyl Tretiakov

Gate the Cheap Axis: when to add a deterministic check before anything breaks

Vasyl Tretiakov — Wed, 01 Jul 2026 07:51:25 +0000

My last essay said every guardrail should be earned from a failure. For cheap, mechanical checks, that's the wrong default: build them before anything breaks.

One commit to a governance document, dated 2026-06-22, deleted this rule:

Add a gate ahead of the first failure only when the class is well-attested and the gate is cheap and low-false-positive, or when a first occurrence is catastrophic. Otherwise the convention rides on discipline until a real miss attests it — then the miss is the test fixture and the gate is added reactively.

and replaced it with this one:

The deciding axis is cost and false-positive rate, not attestation. When a consistency property on a driftable governed surface has a cheap, low-false-positive, mechanically-decidable check, build that gate proactively — before any attested miss. Discipline is not an acceptable substitute for a check this class can make.

That is a default-flip, in a single diff. The old rule waited for a failure before adding most checks. The new one says that for a whole class of check, waiting is the mistake.

I should own the awkwardness up front, because I am the one who wrote the old rule. The project here is an agent-built codebase whose development is governed by more than a hundred deterministic check scripts (gates, in the local vocabulary) that run on commit and refuse work that has drifted out of agreement with the spec. A few weeks ago I published Gates Earned From Failure: the argument that every one of those gates was added in response to a concrete miss, never designed up front, with a cost test for when a guardrail is worth its standing weight. A follow-up, Make It a Check, sharpened the same spine: a lesson worth keeping belongs in a check, because prose drifts and a gate doesn't. So a commit that says "build the gate proactively — before any attested miss" reads, on its face, like a reversal.

It isn't, and the reconciliation is the whole point.

The cost test always had two axes

The cost test was never "wait for a failure." It weighed a candidate gate on two axes: how expensive the gate is to build and maintain, and its false-positive rate (FP) — how often it flags something that is actually fine. "Earned from a real failure" was the rule for gates that score badly on those axes. An expensive, finicky, high-FP gate has a real standing cost: maintenance, the reading load of one more rule, the friction of false alarms. That cost has to be paid for, and an attested miss is what pays for it. The failure proves the gate's class is real, and the failing case becomes the gate's first test fixture.

What the reactive framing under-served was the other corner of the matrix: the gate that is cheap, mechanically decidable, and almost never wrong. For that gate, waiting for a failure buys nothing. You already know the check is correct. You already know it costs almost nothing to run. All the failure does is let some preventable drift through first, so you can point at it and say "see, now it's earned." That is a tax with no return.

So the flip is narrow and precise. The deciding axis is cost and FP. Sort the candidate gate into a class first, then apply the matching rule:

Cheap, low-FP, mechanically decidable, on a governed surface → build it proactively, before any miss. Discipline is not an acceptable substitute, because this project has repeatedly proven that by-eye vigilance lets drift through that a trivial script would have caught.
Expensive or higher-FP → the reactive rule still holds. The convention rides on discipline plus a one-line written rule until a real miss attests it. Then the miss is the fixture and the gate goes in.
Catastrophic on first occurrence → a separate escape hatch. A costly must-have can go in ahead of any miss when the first occurrence is unacceptable.

The earlier essay collapsed the first two classes into one default. This one splits them back out and flips the default for the cheap class only.

What it looks like when you mean it

A doc edit that lowers a bar is cheap to write and easy to ignore. The test of whether the flip was real is what happened next.

An audit came first. It took every existing gate's coupling manifest — the declaration, carried in each check, of which files that check reads — and inverted the whole set into a map from governed surface to drift axis: for each file the project governs, which ways can it silently fall out of agreement with the rest, and which of those ways is anything watching? Then it walked the ungated axes and asked one question per axis: is this cheap and mechanical enough to gate now, before it has ever gone wrong?

The next iteration shipped four new checks off that worklist. One couples a spec section's enum-membership claims to the actual enumeration they reference. One bounds how much cross-component topology a single component's spec may draw before it has to defer to the architecture doc. One binds each workflow stage's entry to the prior stage's completion. The honest fourth is a markdown-fence balance check, which caught a live instance on its first prototype run — so call that one borderline, half-attested rather than purely proactive. The other three guarded drift that had never once occurred in the project's history. Under the old rule, none of the three would exist yet. We'd be waiting for the bug that proves they're allowed.

That is the flip cashed out: an audit that enumerates the cheap mechanical axes, and gates them on the strength of cheap and mechanical, not on the strength of a scar.

The clause that stops it from sprawling

"Gate every cheap mechanical axis" is a dangerous sentence on its own, because the cheapest checks to write are usually the most worthless. Cheap and low-FP is necessary, not sufficient. The gate has to check a real drift axis, never a trivially-true proxy for one.

The canonical bad case is checking that a document has a section heading. It's trivial to write and it almost never fails, so it's cheap and low-FP by both measures. It's also nearly useless, because the thing that actually drifts is the section's content, not the presence of its title. A heading-presence gate goes green while the prose under the heading rots, and now you have something worse than no gate: a check that broadcasts "covered" when nothing is covered. That is false confidence, manufactured by a green checkmark, and a gate that manufactures false confidence is worse than no gate at all. The same governance doc states the fail-closed principle directly — a gate that can silently stop enforcing while still reporting success is a defect masquerading as coverage.

So the rule has a built-in brake, and it's an old warning in new clothes: Goodhart's law, the observation Marilyn Strathern phrased as "when a measure becomes a target, it ceases to be a good measure." A heading-presence check makes has the section the target, and that target is trivially hit while the property you actually cared about, whether the section is still right, drifts free. Cheap earns a gate only when cheap is buying you a genuine consistency property. The moment "cheap" is bought by checking something that can't really fail, the gate is off the table, not on it.

The goal is not zero human review

The flip is aggressive about machines and humble about what it leaves to people. Gate every mechanical axis, and the human residue shrinks — but the target it shrinks toward is not zero. It's the irreducibly semantic judgment: is this prose still true? Is this rationale still the right call? Is this section well-pitched for its reader? No low-FP mechanical check can answer those, and pretending one can is exactly the trivially-true-proxy trap from the last section.

So the division of labor is clean. Everything a machine can decide mechanically and cheaply, a machine decides, on commit, every time. What's left for the human is the layer beneath the mechanism: the meaning. In this project, the judgments about whether a doc has quietly drifted from describing the system to restating it are punted to a human checkpoint at the end of each iteration, on purpose, because they're semantic and high-FP. The point of gating the mechanical axis hard is to clear the human's plate of everything that isn't that judgment, so the scarce attention lands where only attention works.

Where the rule says stop

A rule earns trust by saying where it doesn't apply. The same audit that filed the new gates also produced a register of axes ruled not worth gating, recorded so they don't get re-litigated every iteration. Five of them:

Heading presence beyond the small set of genuinely load-bearing required sections. The trivially-true proxy from above, named as the canonical example to avoid.
A gate's manifest matching its true read-set. Deriving what files a check actually reads means parsing arbitrary shell: globs, greps, hardcoded paths. Not cheap, not low-FP. Left ungated.
Free-prose cross-references between docs (the "see the routing section over in the other file" kind). Higher-FP than a structured markdown link, because the reference is fuzzy text, so the cheap version over-flags.
Whether a high-level doc has drifted into restating a low-level mechanism, and whether a workflow skill is still faithful to its lifecycle table. Both irreducibly semantic. Human checkpoints, not gates.
Whether an architecture diagram still reflects the system's real trust boundaries. The diagram is deliberately simplified, with dozens of drawn edges standing in for roughly a hundred actual permission grants, so any mechanical bijection over-fires. Semantic and high-FP at once. Stays human.

Each of those is cheap to attempt. Every one was ruled out, because cheap-to-attempt isn't the bar. The bar is cheap and low-FP and checking a real axis, and these miss on the second or third. The register is the rule proving it has edges.

Prior art, and the honest size of the claim

Moving deterministic quality checks earlier, before the bug that would have justified them, is not a new idea. The instinct predates software: Shigeo Shingo's poka-yoke, the mistake-proofing he built into the Toyota Production System in the 1960s, redesigns a process so the error cannot be made in the first place, instead of catching the defective unit after it ships. The testing world calls the software version "shift left," after Larry Smith's shift-left testing in Dr. Dobb's Journal in 2001: pull quality assurance (QA) toward the start of the lifecycle instead of bolting it on at the end. Linters reach back further than that, to Stephen Johnson's lint at Bell Labs in 1978; type systems and continuous-integration gates are all configured proactively, before any specific defect, in every serious codebase. None of that is mine.

What this contributes is the decision rule for which checks earn the proactive treatment, and the two clauses that keep it honest. Cost and false-positive rate are the axis that sorts proactive from reactive — not the presence of a scar. The proxy brake stops "gate the cheap thing" from collapsing into "gate the worthless thing because it's cheap." And the semantic-residue clause sets the floor: gate beneath the meaning, never pretend to gate the meaning itself. Shift-left says move checks earlier; this says which ones, and where the line is.

Two honest limits. The evidence is one project's governance layer — a decision rule that has held there, across more than a hundred gates, not a measured result I can hand you a number for. And the whole rule presumes a governed surface: artifacts with a settled, written notion of what "correct" means, so there's something stable to couple a check against. On an exploratory surface where the spec is still moving under you, there's nothing yet to gate toward, and the reactive default is right again — you genuinely don't know what the invariant is until something violates it.

The slogan from the earlier essay still holds: most guardrails are earned from failure. This is the carve-out. When a check is cheap, mechanical, and almost never wrong, on something you've already decided is correct, you don't owe it a failure first. Gate the cheap axis, and spend the failures you'd have eaten on the problems no script can see.

I build and write about agent-governed codebases like this one. If you're working out where the rails belong in an AI-assisted project, I'm reachable on LinkedIn.

References

Vasyl Tretiakov, "Gates Earned From Failure: a cost test for agent guardrails," vasyltretiakov.dev, 14 Jun 2026 — companion essay (the cost test this one carves a class out of).
Vasyl Tretiakov, "Make It a Check," vasyltretiakov.dev, 22 Jun 2026 — companion essay (a durable lesson belongs in a check, not in prose).
Larry Smith, "Shift-Left Testing," Dr. Dobb's Journal, Vol. 26, Issue 9, Sept 2001 — origin of the term; see "Shift-left testing," Wikipedia (accessed 24 Jun 2026). The decades-old practice of moving deterministic quality checks earlier.
Shigeo Shingo, poka-yoke (mistake-proofing), Toyota Production System, 1960s — see "Poka-yoke," Wikipedia (accessed 24 Jun 2026). Designing the process so the error cannot occur, rather than catching it after the defect.
Stephen C. Johnson, "Lint, a C Program Checker," Bell Labs, 1978 — see "Lint (software)," Wikipedia (accessed 24 Jun 2026). The foundational static analyzer; deterministic checks run proactively, before any specific bug.
Marilyn Strathern, "'Improving ratings': audit in the British University system," European Review, 1997 — the popular phrasing of Goodhart's law (Wikipedia, accessed 24 Jun 2026): "when a measure becomes a target, it ceases to be a good measure" — the trap the proxy brake guards against.

Verify the Work, Not the Report: a coding agent's success claim is just a claim

Vasyl Tretiakov — Wed, 24 Jun 2026 11:33:15 +0000

A sub-agent's success report is generator output, not ground truth. Verify the work yourself, and reward the agent that refuses a false premise.

In one session this spring I sliced a workspace-wide rename across a handful of sub-agents, dispatched them one at a time, and validated each commit myself before sending the next. That one habit caught three confident, wrong reports in a single afternoon.

The first was a pass that wasn't a compile. A slice came back green, and the build check it cited had finished in 0.34 seconds. A third of a second is not a compilation of a large workspace; it is a cached echo of an earlier one. A real build, run by hand, took 2.4 seconds and genuinely passed — but the report had no way of knowing that, because it was reading the cache, not the code.

The second was a test result that could not exist. A slice reported 121 passing tests from a build target that, run as the report described it, errors out with did not match any packages. A command that fails to resolve its target cannot also count 121 green tests. The number was real; the command attached to it was not. Run the correct way — against the right manifest — the tests did pass. The work was sound. The account of how it was checked was fiction.

The third was a half-applied change reported as done. A rename swept cleanly through every surface the compiler can see and stopped there, leaving the identifiers and comments it cannot see untouched, and reported the sweep complete. The residue surfaced only because a separate check flagged the old term still living in the code.

My own note from that session, verbatim:

I'm re-validating every sub-agent commit myself rather than trusting the reports. This session alone caught a stale-cache "pass," a mislabelled build target, and a half-applied tree — all recovered without damage.

Here is the part that matters, and that the cynical reading misses. In all three cases the work was fine. The compile passed, the tests passed, the rename was finishable in one clean recovery commit. The unreliable artifact was never the code. It was the report.

A report is weaker evidence than the output

I have argued before, in Compiling the Process, that the cheap way to make a high-throughput, low-consistency generator reliable is to put a deterministic verifier in front of it and exploit the asymmetry: checking an output is far cheaper than producing a correct one. A compiler is one instance of that pattern; a test suite is another.

A self-report sits at the wrong end of the same asymmetry. It is the generator narrating its own output — a lossy summary produced by the same process that produced the work, with no independent signal behind it. That makes it strictly weaker evidence than the output it describes. You can re-run the output. You cannot re-run a sentence.

This is not a quirk of one toolchain. Huang and colleagues, in Large Language Models Cannot Self-Correct Reasoning Yet (ICLR 2024), found that "LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction." A model asked to grade its own reasoning, with nothing external to check against, does not reliably improve and sometimes gets worse. Birgitta Böckeler makes the practitioner's version of the same point about tests: "the agent also generated the tests. So maybe then I have to think about feedback for the test." A verifier the generator authored is not independent of it. A self-report is the limit case — the generator grading itself with no separate signal at all.

So the discipline that everyone already accepts for code applies just as hard one level up. We do not trust agent-written code without a check. We should not trust an agent-written status claim without one either.

Verify the real signal, not the claim

"Verify the report" is easy to nod along to and easy to do wrong. Re-reading the agent's summary more carefully is not verification; it is reading the same fiction twice. Verification means re-running the genuine signal yourself and reading what the system actually returns.

The stale-cache pass is the sharp lesson. The tell was not in the words of the report — those said green, confidently. The tell was the 0.34-second timing, which a real build of that workspace cannot achieve. The fix was to force a genuine compile and read its true duration and exit. The mislabelled target is the same shape: the giveaway was that the cited command could not have produced the cited number, so the number had to come from somewhere the report wasn't naming. In both, the resolution was the same boring move — run the actual command, against the actual target, and read the actual output, rather than the prose about it.

Worth keeping honest about: the verifier can be fooled too. In the stale-cache case my own tooling briefly misled me — the editor's diagnostics were surfacing phantom errors from a mid-edit snapshot, exactly as stale as the cache that produced the false pass. A verifier is only as trustworthy as the freshness of the signal it reads. The defense is to anchor on something the generator could not have produced by accident: a real build time, a real exit code, a real test count from a command that actually resolved.

The inverse virtue: reward the refusal

The same skepticism that distrusts a false success has to cut the other way too, or it curdles into "assume the agent is always wrong." It is not a trust problem. It is an evidence problem. And the cleanest evidence an agent can produce is sometimes the report you least want: your premise is false, and I will not proceed.

In a later build session, two of six delegated tasks came back not as code but as refusals — each because the live codebase contradicted the task's stated premise. One task fixed a count at eight; the agent computed it against current code and found twenty-four, then stopped rather than implement against a number it could see was wrong. Another assumed an existing enforcement check was coarse; the agent traced it and found it had been precise since the day it shipped, reverted its half-built change to a clean tree, and bounced the task back to the design stage. Neither guessed. Both surfaced the contradiction and asked.

That reframes the boundary between deciding what to build and building it. The build stage is not only where code gets written; it is the first place a plan meets live code, which makes it a premise-falsification checkpoint. Its most valuable output, some fraction of the time, is a falsified premise rather than a commit. The usual trigger is an operator's skeptical question. In one case a name had been bothering me, I pushed on it, and the reply came back "You're right on both halves — verified," followed by the evidence and a decision to route the change back to design rather than improvise it.

So the orchestrator's job is symmetric. Distrust the confident success until you have re-run it, and protect the honest refusal instead of overriding it. Both are the same verifier doing the same work: separating a claim from the ground truth it is supposed to describe.

The rule, and why it travels

The operating rule is small. Independently verify after every sub-agent commit. Re-run the genuine signal, not the agent's summary of it. Size each verification so it stays cheap relative to the work — the asymmetry is the whole point, and it collapses the moment checking costs as much as redoing. Treat DONE and PASS as claims awaiting a check, never as facts. And when an agent comes back with a falsified premise instead of a commit, count that as the system working.

None of this is specific to one stack. Anthropic's 2026 Agentic Coding Trends Report frames the shift bluntly: "Software development is shifting from writing code to orchestrating agents that write code." Any orchestrator delegating to a probabilistic generator inherits the same problem. The reports come back; the reports are generator output; and the only thing that converts a report into knowledge is a verifier the reporter did not author.

Honest limitations

This is "trust, but verify," which is old enough to be a proverb. I am not claiming the principle. The only thing new here is where it points — at the agent's status claims, not just its code — and the observation that the verification is cheap enough, against a deterministic signal, to run after every single commit rather than as an occasional audit.

I also want to be precise about what failed, because the easy reading overclaims. In every case above, the work's substance was sound. The model was not a bad worker and was not being willful; it produced a lossy summary of its own output, which is what a low-consistency generator does. The unreliable thing was the report as a piece of evidence, not the code it described. An essay that concluded "agents can't be trusted to do the work" would be drawing the wrong lesson from these receipts.

The discipline only pays while verification stays cheaper than the work — the cost test from Gates Earned From Failure applies here too. If re-validating a result costs as much as producing it, the asymmetry is gone and you are just doing the work twice. The cases that pay are the ones with a fast, deterministic signal to re-run: a compile, a test count, a gate's exit code.

And the usual scope caveat: this is one solo-directed codebase of roughly 150,000 lines, with sub-agents drawn from a single model family, observed across hundreds of development sessions. The receipts are many; the weak point is elsewhere. It is one codebase, one model family, and there is no controlled study against which to measure how often the failure fires versus how often the check catches it.

A claim is not a check

The thread running through all of it: a report is a claim, and a claim earns trust only from a verifier the claimant did not author. That is the rule for the agent's code. It is the same rule for the agent's account of its code. Re-run the work, read the real signal, and treat the honest refusal as worth more than the confident pass.

I write about the intersection of Contact Center systems and AI-assisted engineering, and I take on a small amount of consulting and delivery work in that space — find me on LinkedIn.

References

Jie Huang, Xinyun Chen, Swaroop Mishra, et al., "Large Language Models Cannot Self-Correct Reasoning Yet," ICLR 2024 (accessed 24 Jun 2026).
Birgitta Böckeler, "MCP, Vibe Coding, and Harness Engineering," InfoQ podcast, 14 Jun 2026 (accessed 24 Jun 2026).
"2026 Agentic Coding Trends Report," Anthropic, 2026 (accessed 24 Jun 2026). Cited for framing only; vendor source.
Vasyl Tretiakov, "Compiling the Process," vasyltretiakov.dev, 17 Jun 2026.
Vasyl Tretiakov, "Gates Earned From Failure," vasyltretiakov.dev, 14 Jun 2026.

Published at vasyltretiakov.dev.

Make It a Check: compounding a coding agent's lessons as gates, not CLAUDE.md prose

Vasyl Tretiakov — Mon, 22 Jun 2026 06:14:53 +0000

Compound engineering writes each lesson into the agent's prose. The ones that matter should be checks instead: prose drifts, a gate doesn't.

The canonical guide to compound engineering describes its final step, the one it names compound, in plain instructions: once a task is done, you "Add new patterns into CLAUDE.md, the file the agent reads at the start of every session." I followed that advice for months. The patterns piled up, the agent dutifully re-read them at the start of every session, and a convention I genuinely cared about kept drifting anyway. The rule was right there in the always-loaded file. The agent's output drifted from it under context pressure, fluently and confidently, reaching for whichever term it had seen most recently in the file it was editing rather than the one the rule named. That was never defiance. The model doesn't decide to break a rule; under a crowded prompt it simply weights the tokens in front of it above a line it read far earlier.

The fix that finally held was not a better paragraph. It was a twelve-line script that reads the staged diff and blocks the commit when the wrong word appears. The first time I ran it across the project, it flagged 737 violations of a rule I had already written down and trusted the agent to follow. (I told that terminology-drift story in full in Rails, Not Rules.) That gap between "written down" and "enforced" is the whole essay. The lesson worth keeping should not be added to the prose the agent must remember. It should become the thing that catches the mistake automatically the next time.

The loop is right. The last step isn't.

Compound engineering gets the loop right, and I want to concede that before I disagree with any of it. The premise, that "each unit of engineering work should make subsequent units easier, not harder," is exactly the right ambition for directing an agent. You plan, you work, you review, and then you compound: you feed what you learned back so the next run starts smarter. I run a version of that loop every day, over a codebase of roughly 150,000 lines that a coding agent and I built together, and four published essays came out of watching it run.

My disagreement is narrow and it lives entirely in the last step. Compound engineering's compound step terminates in remembered prose, and remembered prose is the one thing this loop has taught me an agent cannot be trusted to honor. Here is the honest part, though: the same guide, in the very last sub-step of compound, asks a sharper question than its own advice answers. It says to check whether "the system [would] catch this automatically next time." That sentence is the argument I want to make, left sitting as an aspiration next to the instruction that undercuts it. Don't add the lesson to the file the agent reads every session and still applies unevenly. Build the thing that catches the mistake automatically next time.

Why prose loses and a check wins

The reason is an asymmetry I spent a whole companion essay on, so I will state it once and move on. An LLM is a high-throughput, low-consistency generator: prolific, genuinely creative, and unreliable about rule forty-seven when that rule is one line among hundreds of always-loaded instructions competing for its attention. A paragraph in an instruction file leans on that generator to remember and apply the rule every single time. A check does the opposite. It is a cheap, deterministic verifier sitting in front of the generator, and verifying that a constraint holds is far cheaper and far more trustworthy than generating output that satisfies it on the first try, every time. Putting the rule in prose bets on the unreliable half. Putting it in a check bets on the reliable half.

There is a cost asymmetry stacked on top of the reliability one, and it is the part that surprised me. When I worked through whether a slightly redundant check was worth keeping, the accounting came out lopsided: "a doc paragraph costs tokens every session; a redundant gate costs only when it runs." The part of that cost the agent actually pays, tokens in its context, is narrower still. A check that passes is silent; the agent never sees it and pays nothing. Only a failure spends tokens, and only to feed back the one message that steers the next attempt. Prose in the always-loaded file has no such off switch. It is a tax the agent pays on every task, forever, whether or not the rule is relevant to what it is doing, and the prose you accumulate to make the agent smarter slowly crowds out everything else it needs to hold in mind. Compounding lessons as text does not just drift. It bloats the exact surface whose scarcity caused the drift.

The loop, named as a loop

If a check beats a paragraph, the practice is no longer "compound the lessons" but something more specific. Observe the failure as it surfaces in the work, grade it against a cost test, and mechanize the half you can make false-positive-free. The rest stays discipline, on purpose. That loop is not a proposal. It is the thing that, run for a few months, produced four essays, each one a different output of the same step:

A vocabulary that kept drifting became an enforced terminology check (Rails, Not Rules).
Two artifacts that were supposed to agree, and silently didn't, became a bidirectional coupling check (Couple Both Ways).
The question of which lessons earn a check became a cost test (Gates Earned From Failure).
And the realization that a large fraction of these checks govern the process rather than the code became the compiler essay I just cited.

Read that way, the four are not separate ideas. They are four turns of one crank. Compound engineering and I agree on the crank. We disagree on what comes out the end: its version produces more text for the agent to read, mine produces a script that makes the text unnecessary.

Read the transcripts, not just the diffs

The "observe" step has a wrinkle worth its own paragraph, because it is where my version of the loop gets most of its material. Git history tells you what shipped. It is a poor record of why, and it can make deliberate work look accidental and accidental work look deliberate. The session transcripts record the second thing: the reasoning, the dead ends, the moment a convention first cracked. I mine them later like a dated archive, which only works because a transcript is immutable. An idea you didn't act on is dormant in there, not lost.

That habit also keeps me honest, which I learned the embarrassing way. While drafting an earlier essay, a research note of mine attributed a tidy phrase to a named expert. It was load-bearing for a whole section. When I went to verify it against her actual work before publishing, the phrase appeared nowhere she had ever written. I had fabricated it, or my notes had, and the prose felt completely authoritative while being false. The live check caught what my own confident memory did not. That is the same lesson as the terminology gate, turned on myself: the durable, re-runnable verification holds; the remembered version drifts, even when the rememberer is me.

Encode the rule, not the state

Granting that a lesson should become a check, there is a craft to making one that survives. The sharpest framing I have for it came out of a session where I asked the agent, half-seriously, what software could borrow from how DNA encodes a body: "DNA manages to express these body-build instructions in [a] very succinct yet effective way, e.g. each cell lies within 5 cells from a blood vessel." The useful answer was the opposite of compression. The rule it landed on was to "store the generative rule plus a gate that re-derives the invariant, never the enumerated current state."

DNA never stores "every cell sits within five cells of a vessel." That is an emergent property. It encodes a local rule, roughly "a starved cell releases a signal that grows a vessel toward it," and the global property falls out. The software analog is direct. Every place an instruction file enumerates the current state of the world, a count, a list of components, the set of valid destinations, is a place that will rot the moment the world changes and nobody updates the prose. Encode the generating rule and let a check regenerate the fact on demand, and there is nothing to keep in sync. A related move is what I think of as the promoter pattern: "don't express the genome, transcribe the gene when its signal arrives," which in practice means an index that loads the one relevant section only when the task needs it, rather than paying for the whole document every session. I am not alone in converging here. Birgitta Böckeler describes the same shift away from always-on context toward skills where "the agent only loads everything that's in that skill folder when the LLM thinks that this is relevant," and Thoughtworks' latest Technology Radar files "Agent Skills" under feedforward controls for the same reason.

One borrowed instinct is worth refusing. DNA is robust to point mutations because it has no per-cell repair supervisor, so it tolerates redundant, degenerate encodings. A governed agent system wants the opposite. You have a repair agent, so you want fail-closed brittleness: a near-miss should break the build loudly, not quietly still work. Borrowing DNA's fault tolerance would just hide drift behind a green checkmark. Keep the checks brittle.

The half I deliberately didn't automate

A synthesis about mechanizing lessons owes you the place it refused to. The step where a failure gets noticed and a few get promoted into checks is neither automated nor a tidy review pass. There is no batch job that reads a session at close and proposes gates; the judgment happens in real time, in the back-and-forth while the work is underway. Usually I am the one who has to ask "should this become a check?", because the agent does not reliably surface that question on its own; in my experience it is simply not what the model is tuned to reach for. I have looked at automating the front of it, a scanner that reads every transcript and proposes every candidate check, and declined. It would be judgment-heavy and noisy, and a noisy suggestion stream is one you learn to ignore. The cost test that governs the code governs the tooling too: a miss here is cheap and recoverable, so the honest call is to leave the noticing to the live debate and not build the tool. Saying "this is where I chose not to automate" is the anti-hype move the rest of the argument has to earn.

There is a quieter version of the same discipline that I had to be taught mid-session. Having built a wrapper to settle a question once, I caught myself about to re-audit its internals by hand "just to be sure." The correction was crisp: "don't re-orchestrate its steps and don't pre-audit its internals. Verifying loudly fails at runtime if the runner's actually broken; pre-emptive grepping just re-derives what the script exists to settle." If you trust the check, trust it. Re-checking it by hand pays back the exact cost the check was built to eliminate.

What I'm standing on, and what I'm not claiming

None of the pieces here is mine alone, and pretending otherwise would be the tell this audience is right to distrust. The generator/verifier asymmetry is the oldest idea in our tooling: type checkers, linters, continuous integration, and the plain test suite all exist because checking a property is cheaper than producing code guaranteed to have it. The loop is compound engineering's. The front half of it, starting from a written specification, is now a named and tracked technique; spec-driven development has tools like GitHub's Spec Kit and Amazon's Kiro, and Spec Kit even ships a "constitution" of "immutable principles that must always be followed." I would only note that a constitution is still prose the agent must honor, which is precisely the surface I am arguing drifts unless something enforces it. And the cost Thoughtworks names "cognitive debt," the widening gap between humans and the systems agents generate, is the thing this whole practice is trying to pay down. My one narrow claim is the join: that the compound step is cheaper and more durable as a check than as a remembered paragraph, wherever the lesson is regular enough to verify mechanically.

Honest limitations

The deepest limit is that this is not the agent learning anything. It is me externalizing my own attention. The deficit a check repairs is salience, not memory, which is also why I think the practice transfers to human teams who never had a memory problem to begin with. That argument deserves its own essay and is getting one, so I will leave it as a claim here rather than develop it.

Three smaller limits. This is one person on one codebase, where I both own the domain model and write the checks that judge it, so the verifier is never adversarially independent of the thing it verifies. Böckeler puts the same worry one layer up when she notes that "the agent also generated the tests," and a verifier the generator authored is exactly the weak feedback I am warning about. Second, the un-mechanizable half really does stay prose; the argument is "graduate the lessons you can verify cleanly," never "automate everything," and the judgment of which is which is the skill that doesn't reduce to a script. Third, the genre I am positioning against is moving fast, and some of what reads as disagreement today may just be where the practice lands next quarter. Its own "would the system catch this next time?" suggests the distance between us is shorter than the framing implies.

Why this generalizes

Strip away my particular project and the shape is plain. Anthropic's 2026 Agentic Coding Trends Report frames the industry move as a shift "from writing code to orchestrating agents that write code," and orchestration is exactly the context where this arithmetic starts to bite. Any agent working on a governed system accumulates drift faster than a human would, because it has no continuity between sessions, and it will author the check that catches that drift in a few minutes, because writing the check is the kind of small, bounded task it is good at. High drift and nearly-free enforcement is a combination no human team has ever faced, and it changes the arithmetic of the compound step. When a lesson is regular enough to verify, writing it into prose the agent must remember is the more expensive option and the one that fails. So when the loop comes back around to compound, and you are about to add another paragraph to the file the agent reads every morning, ask the better question the guide already buried in its own checklist: would the system catch this automatically next time? If it can, make it. If it can't, that is exactly the prose worth keeping.

This essay was written by directing a coding agent over the project and the four prior essays it synthesizes; I direct and judge, the agent drafts and argues back. The DNA framing and the "encode the rule, not the state" rule came out of one such argument, quoted here close to how it happened.

I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're working on, I'm reachable on LinkedIn.

References

Kieran Klaassen and Dan Shipper, "Compound Engineering," Every (accessed 17 Jun 2026). The loop this essay concedes and the compound step it argues against, including "Add new patterns into CLAUDE.md" and "Would the system catch this automatically next time?"
Vasyl Tretiakov, "Rails, Not Rules: Enforcing a Coding Agent's Domain Vocabulary with Checks," vasyltretiakov.dev, 1 Jun 2026 — the terminology check and the 737-violation receipt.
Vasyl Tretiakov, "Couple Both Ways," vasyltretiakov.dev, 9 Jun 2026 — the bidirectional coupling check.
Vasyl Tretiakov, "Gates Earned From Failure," vasyltretiakov.dev, 14 Jun 2026 — the cost test for grading a lesson.
Vasyl Tretiakov, "Compiling the Process," vasyltretiakov.dev, 17 Jun 2026 — the generator/verifier asymmetry, stated in full.
Birgitta Böckeler, "From MCP and Vibe Coding to Harness Engineering," InfoQ podcast, 8 Jun 2026 (accessed 14 Jun 2026). Sources "the agent also generated the tests" and the lazy-loaded-skills observation.
Thoughtworks, "Combat AI cognitive debt — Technology Radar Vol. 34," 15 Apr 2026 (accessed 17 Jun 2026). Source of "cognitive debt" and the feedforward/feedback control framing.
Thoughtworks, "Spec-driven development," Technology Radar (accessed 17 Jun 2026); GitHub Spec Kit's "constitution" of immutable principles. The named prior art for the loop's front half.
Anthropic, "2026 Agentic Coding Trends Report" (accessed 17 Jun 2026). Source of "from writing code to orchestrating agents that write code"; cited for the framing of the shift, not as evidence of results.
Eric Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. ISBN 978-0321125217. The ubiquitous-language lineage the terminology check enforces.

Published at vasyltretiakov.dev.

Compiling the Process, Not the Code: a machine-checked workflow for coding agents

Vasyl Tretiakov — Wed, 17 Jun 2026 07:43:53 +0000

The deterministic checks guarding an agent-built project are a compiler for the workflow — and the load-bearing half of them compile the process, not the code.

Halfway through a Tuesday in early June, after I'd spent the morning turning a handful of written-down conventions into scripts that block a commit when they're violated, I typed a half-formed thought at my coding agent instead of another task:

it feels like with script gates we're building a compiler for our coding-agent-enabled workflow

There were around three dozen of these checks in the project by then. None of them compiles anything in the usual sense. They grep, they parse, they diff one artifact against another and exit nonzero when two things that must agree don't. But the feeling was specific and it wouldn't go away, so I want to take it seriously: what is true about it, and exactly where it stops being true. The place it breaks is the useful part.

The asymmetry under the feeling

The agent's first move was to tell me "compiler" was the wrong word for the right intuition. The precise version is a generator/verifier split, and a compiler is one instance of it.

An LLM is a high-throughput, low-consistency generator. It is prolific and genuinely creative, and it is unreliable about rule forty-seven under context pressure. It will use whichever term it saw most recently, in whichever file it's editing, and be fluent and confident while doing it. The cheapest way to make an unreliable generator reliable is not to lecture the generator. It's to put a deterministic, cheap verifier in front of it and exploit an asymmetry: checking that a constraint holds is far cheaper, and far more trustworthy, than generating output that satisfies it on the first try every time. A compiler is that pattern frozen around a language. What I was building was the verifier sized to my generator's specific failure modes.

I should concede immediately that none of that asymmetry is new. It is the oldest idea in our tooling. Type checkers, linters, continuous integration, and the humble test suite all exist because verifying a property is cheaper than producing code guaranteed to have it. The compiler metaphor isn't new either; people have reached for it about prompts and pipelines for a while. So if the essay were just "checks make agents more reliable," it would be a familiar and slightly boring claim. The interesting part is narrower, and it lives in the gap between what a compiler promises and what these checks actually deliver.

Why it feels like a compiler and not just a pile of linters

Most lint suites are a pile. You accumulate rules, you run them, none of them knows about the others. The tell that this had become something more compiler-shaped was that I had started writing checks whose job was to police the other checks: a script that reads every gate and verifies each one is registered, named to convention, and wired into the commit hook the same way. Rules about the rules, mechanically enforced.

That meta-layer is the line between a lint pile and what a compiler-minded person would call static semantics. A pile says "these specific tripwires didn't fire today." A semantics says "the rules themselves are well-formed, so a whole class of malformed rule can't exist." Crossing that line is what makes the workflow feel like it has a grammar rather than a checklist. It is also, as it turns out, exactly the feeling you should be most suspicious of.

Three places the metaphor breaks

A careful reader should distrust a flattering metaphor, so here is where this one falls apart. There are three breaks, and each one is load-bearing.

These are linters, not a type system. A compiler makes a guarantee with a specific shape, the one Robin Milner named soundness: if a program type-checks, an entire class of error is impossible — "well-typed programs cannot go wrong." My checks make a guarantee with a much weaker shape: these particular tripwires did not fire on this particular commit. A spelling denylist catches the words on the list and is blind to the synonym nobody listed. A check that greps a response shape proves the shape is present, not that the behavior behind it is right. The space between the invariant I care about and the grep that approximates it is precisely where the real bugs continue to live, and a green suite hides that space rather than illuminating it. I spent a whole companion essay, Gates Earned From Failure, on the hazard that follows from this: a passing check converts active vigilance into misplaced trust. The compiler-specific version of the point is just that the word "compiler" smuggles in a soundness guarantee these checks have not earned.

It's accreted, not designed — there's no source grammar. A real compiler starts from a language definition. Someone wrote down what a well-formed program is, and the checker enforces that document. My "well-formed workflow" was never written down. It is defined, entirely and only, by the union of whatever checks happen to exist on a given day, and each of those checks was born reactively from one incident that had already bitten me. So the de-facto language the agent is being held to is "whatever passes today," and the gaps in it stay invisible until the next incident reveals one. The always-loaded instruction file is the closest thing to a grammar I have, and it is a prose approximation, not a specification.

The programmer and the language designer are the same agent. This is the one that should worry you most. A real compiler is adversarially independent of the code it judges; the people who designed the language had never seen your program. In my setup the same agent writes the artifact and writes the gate that will judge the artifact's future versions. A gate can therefore faithfully encode the author's blind spot and then certify work as clean against it, forever. Enforcing over discipline is a good rule with a hard ceiling: a gate only ever catches what someone already thought to mechanize. Birgitta Böckeler makes the same observation one layer up, about tests, when she notes that the feedback is weaker than it looks once "the agent also generated the tests." A verifier the generator authored is not independent of the thing it verifies.

The half that compiles the process

Here is the part worth the essay. When I actually looked at my three dozen checks and asked what each one verifies, a large fraction of them were not checking the code at all. They were checking the process.

One enforces that a task needing a written specification can't be picked up in an implementation phase. One flags a task whose declared dependency has gone stale. One enforces the lifecycle of an amendment: proposed, reviewed, accepted, in that order, with no skipping. None of these reads a line of program logic. Each one encodes a piece of software-development-life-cycle (SDLC) discipline that, on a human team, lives in nobody's compiler and everybody's head. We call it professionalism, or being senior, or knowing how things are done here.

Human teams almost never mechanize that layer, and the reason is economic, not cultural. People hold context across days and weeks, judgment tells them when a rule applies, and their name on the commit supplies accountability. Those three things are exactly what let a team write "we follow the amendment process" in an onboarding doc and then assume compliance. Building a script to enforce it would cost more than the drift it prevents on a small team that mostly remembers anyway.

With an agent, that economic calculation inverts, hard. The generator has no continuity between sessions, so the drift rate is high. And the same agent will author the enforcing check for you in a few minutes, so the cost of mechanizing is near zero. When drift is frequent and gates are nearly free, you are rationally pulled to formalize methodology that was never worth formalizing in the entire prior history of software teams. That is the actual discovery hiding under the compiler feeling. Not that I added a spell-checker for my domain vocabulary, but that the economics of agent-directed work quietly made it worth compiling the development process itself into machine-checked constraints.

The instruction file stops being a manual

The same hour I sent that "compiler" message, I followed it with a request that only makes sense if the metaphor is real:

can we retire or compress or move (to HANDBOOK) some instructions… Errors and warnings already explain the error briefly and point to a HANDBOOK section with more details

Once a check owns a convention, the paragraph in the always-loaded file that used to beg the model to remember that convention is dead weight. The gate does the remembering. So you delete the prose and let the error message carry a pointer to a longer explanation that gets read only when the check actually fires. By the end of that session the project had settled into three clean layers: an always-loaded file that states directives, a handbook that explains mechanism, and the gates that enforce. The prose had stopped duplicating the type-checker.

That is the compiler architecture made literal, and it reframes what the instruction file is for. It is not a manual the agent is trusted to follow. It is closer to a grammar: the small, stable statement of intent that the enforcement layer makes binding. Correctness moved out of the prose and into the checks, and the prose got shorter and more honest about its job.

Which invariants earn a type rule

The skill this leaves you with is not "write more gates." A check family only ever grows, and a verifier that accepts everything is as useless as one that rejects everything. The real judgment is which invariants are cheap enough, sound enough, and quiet enough about false positives to deserve promotion into a type rule, and which should stay discipline. That judgment is a cost test, and I worked it out in detail in Gates Earned From Failure rather than repeat it here: build the rule when the evidence the drift is real, times the cost of that drift, outweighs the standing cost of the check.

The compiler framing adds one sharp reminder to that test. Over-strict typing is a real failure mode, not a safe default. The same morning all of this crystallized, I dropped a gate I'd built, because it fired on too many legitimate cases to be worth its noise. In compiler terms that gate was refusing to compile sound programs, and a type system that rejects valid code is exactly as broken as one that admits invalid code. A rejected-but-correct commit teaches the same lesson a missed bug does: the rule was wrong. Calibration runs in both directions.

Honest limitations

This is one person's experience with one codebase of roughly 150,000 lines, where I own the domain model and the workflow both, so the language designer and the programmer being the same agent is my daily reality rather than a thought experiment. That collapse is the project's sharpest limit, and I can't tell you I've escaped it. I can only tell you I try to audit what isn't gated as deliberately as I build what is, because the suite measures the absence of known tripwires, never soundness.

There is also no source grammar, and I'm not sure a healthy version of this ever gets one. The honest description of the system is not "a compiler for my workflow" but "an accreting pile of verifiers that has grown a thin layer of self-governance and started, in places, to behave like one." Whether that hardens into something with a real specification, or stays a well-tended pile, is a question I haven't answered. And the whole economic argument turns on a solo author hitting his own drift the same day he builds the gate. On a team the context cost compounds and the false-positive triage lands on people who never felt the original failure. I have evidence for the solo case and a suspicion the shape survives the move, which is not the same thing.

Why this generalizes

Strip away my particular domain and the pattern is plain. Any agent working on a governed system accumulates drift faster than a human would, and will author the check that catches it for almost nothing. The first checks you reach for are the obvious ones, over the code and the vocabulary. The coupling-check pattern that catches a term renamed in three surfaces and missed in the fourth is one I've written about on its own, in Couple Both Ways. But the ones that surprised me, and the ones I'd point a skeptical reader toward, are the checks that compile the process — the stage gates and lifecycle rules and staleness tripwires that encode how the work is supposed to flow. Human teams carry that in their heads because mechanizing it never paid. Directing an agent is the first context I've worked in where it does.

This essay was written by directing a coding agent over the project it describes; I direct and judge, the agent drafts and argues back. The reframing from "compiler" to "generator/verifier," and the three ways the metaphor breaks, came out of that argument — quoted here roughly as it happened.

I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're chewing on, I'm reachable on LinkedIn.

References

Vasyl Tretiakov, "Rails, Not Rules: Enforcing a Coding Agent's Domain Vocabulary with Checks," vasyltretiakov.dev, 1 Jun 2026 — companion essay; the enforce-over-discipline origin.
Vasyl Tretiakov, "Couple Both Ways: bidirectional checks against silent drift," vasyltretiakov.dev, 9 Jun 2026 — companion essay; the coupling-check pattern in depth.
Vasyl Tretiakov, "Gates Earned From Failure: a cost test for agent guardrails," vasyltretiakov.dev, 14 Jun 2026 — companion essay; the cost test and the green-checkmark hazard.
Birgitta Böckeler, "From MCP and Vibe Coding to Harness Engineering: How Did AI Native Engineering Evolve in One Year," InfoQ podcast, 8 Jun 2026 (accessed 14 Jun 2026). Source of "the agent also generated the tests."
Robin Milner, "A Theory of Type Polymorphism in Programming," Journal of Computer and System Sciences 17(3):348–375, 1978 (accessed 17 Jun 2026). Origin of the type-soundness guarantee ("well-typed programs cannot go wrong") that breakage #1 says these checks have not earned.
Eric Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. ISBN 978-0321125217. (The ubiquitous-language lineage the terminology checks enforce.)

Published at vasyltretiakov.dev.

Gates Earned From Failure: a cost test for agent guardrails

Vasyl Tretiakov — Sun, 14 Jun 2026 16:00:43 +0000

Every guardrail in my agent-built project was earned from a real failure, not designed up front. A cost test for when to build one, when to wait, and when to retire it.

On a Wednesday in late May I caught a bug by reading. The project's glossary — the canonical list of the domain terms my coding agent is required to use — had drifted from the domain model I actually carried in my head. Nothing flagged it. No test failed, no check fired, no compiler complained. I noticed because I happened to be reading the file and the words were wrong.

What I typed next is the whole argument in one line. Not "fix the glossary," not "I'll be more careful," but a question aimed at the toolchain instead of the error:

the GLOSSARY already drifted away from my domain model vision and I would like to prevent this in future refactors

That reframes the task. The drift stops being an error to correct once and becomes a signal about the toolchain: if I caught a class of drift by reading, an enforcement axis is missing. The fix isn't to read harder next time. Reading harder is just minting another rule in your head, another wish. The fix is to add the check that couples the two things that drifted, and then never spend that attention again.

I did not sit down and design a governance system, and that's the most transferable thing here. Every check in this project — a couple dozen now, wired as pre-commit hooks — was born from a particular drift I'd already been bitten by. The ordering is the point. The failure came first; the gate came second, shaped to the exact failure.

Renames stopped being discipline

The clearest place this showed up was renaming. (I told the measurement side of this story in a companion piece, Rails, Not Rules: the first time I scripted a terminology check it found 737 violations in code I'd thought I was keeping clean. This is the other half — not the count, but the habit that grew from it.) Early on, retiring a domain term meant what it means everywhere: search, replace, move on. Then a retired term came back. Not because anyone re-typed it, but because a rename had swept the code and missed the surfaces a compiler never reads — a spec document, a generated config, a comment that mattered. Weeks later the old word resurfaced through a path I hadn't thought to look at.

The detail that stung was where the residue hid. In one rename, the leftover instances of the old term weren't in some forgotten corner of the codebase. They were in the specification documents, the very files being rewritten to drive the rename in the first place. The agent, under my direction, was editing the spec to declare the new vocabulary while leaving the old vocabulary scattered through the same spec. The document asserting the rename was itself out of compliance with it. You cannot out-discipline that, and watching the diff doesn't save you either: no amount of review reliably catches the contradiction being introduced in the very pass that's concentrating on the change. Only a check that reads the document back to you after the fact will, and that's what eventually went in.

It helps to be clear about why renames recur at all, because it isn't sloppiness. The glossary is the written home of what Domain-Driven Design (DDD) calls the ubiquitous language (UL): the single vocabulary that runs from a domain expert's sentence to a class name. Eric Evans is emphatic that this language is not written once and frozen; it is continuously distilled, sharpened every time the team's understanding improves. Each sharpening is a rename. So the renames are the system working as designed, not a mess to stamp out — which is exactly why the residue problem is permanent and worth a gate rather than a resolution to be tidier.

After that, a rename was never just "use the new word now." It became three things that ship together: rename the term, add a glossary entry recording the old word as retired, and add a check keyed to that retired word so it can't silently return. The rename and the rail land in the same commit. Skip the rail and you're trusting that every future sweep will be perfect, which is the assumption that just failed.

So far this is a tidy story: let each gate be earned by a real, observed drift, and your instruction file never bloats into the 200-line document nobody, human or model, can hold in their head. A rule you add from imagination is a guess about a failure that may never come. A rule you add from a failure you actually hit is paid for. It has a body behind it. The failure log is the design document.

I believed that cleanly for about a month. Then I had to argue with my own agent about it, and the rule came out more interesting than it went in.

The exceptions that earn a gate early

The trouble with "wait for the drift" is that some drifts you cannot afford to observe even once. If the first occurrence is a deleted production table, or customer data in a log line, or a fabricated citation in something already published, "let it go wrong cheaply once" is a contradiction. There is no cheap once.

I put this to the agent as a proposed amendment: build a gate proactively when the failure mode is already well-attested and the check is cheap and low-false-positive, or when the first occurrence is expensive or irreversible. Stay reactive when the rule is judgment-heavy, the convention is still in flux, or the drift is cheap to catch and fix once. The reasoning is mundane risk math, not ideology: you pre-pay for a gate exactly when the expected cost of waiting exceeds the standing cost of the gate.

This repo runs on that amendment, and I can point at two receipts. The writing project, the one these essays live in, has a check that blocks any cited URL not logged in an evidence file. I built it before it had earned itself in the usual reactive way, because the failure it guards is the expensive kind and it had already happened once: a fabricated citation reached a draft. A published fabrication doesn't get a cheap first occurrence. So the gate went in proactively, licensed by the cost test: is the failure high enough cost, and is the check free of false positives?

The second receipt is the other half of the amendment, the well-attested-and-cheap case rather than the irreversible one. The same writing project runs a prose check on these very drafts, flagging the small set of mechanical tells that mark machine-generated text. That failure mode wasn't hypothetical and it wasn't expensive-once; it was a known, recurring, deterministic pattern I'd have to correct on every essay forever. Well-attested, cheap to detect, low false-positive: that's the second carve-out, and it's the reason this paragraph can't lean on more than two em-dashes without the gate stopping the commit. The check is the essay's own argument applied to the essay. The rule "gates are earned" survives both times, with named carve-outs for failures you can't afford to rehearse and failures you've already seen enough times to name precisely.

When a rule won't gate, find its shadow

Between "earn it from a failure" and "stay reactive" sits the move that does most of the real work, and it's the one I'd most want a reader to steal. A lot of rules look un-gateable because the thing they protect is a meaning, and a script can't read meaning. "Don't restate the protocol prose in two places" isn't grep-able. Neither is "don't leave a dangling reference." The reflex is to give up and write the rule into an instructions file as a wish. The better move is to stop checking the rule and check a structural proxy for it: a syntactic shadow the real rule casts, one a script can see and that almost never flags an innocent.

It works more often than it has any right to. "Don't duplicate the protocol in prose" became "no narrative text in column zero of these files," because the protocol lived in indented blocks and unindented prose was the tell. "No dangling references" became "every TODO( carries a live task slug." The em-dash rule guarding these very drafts is the same trick: I can't gate "don't write like a language model," but I can gate "no paragraph carries more than two em-dashes," which is a measurable shadow of the thing I actually want. Each time, a rule I'd have sworn was judgment-only turned out to cast a shape a grep could catch.

The proxy is never the rule, though. A structural shadow is an approximation by construction, so every proxy gate ships with a built-in gap between what it checks and what you wish it proved. That gap is the next failure.

The failure that hides inside a green checkmark

Here is the part I got wrong, and the agent caught.

I had been treating "cheap to write" as most of what makes a gate worth building. The agent pushed back on two fronts. First, cheap has to mean cheap to maintain, not cheap to write. A gate is a standing liability, not a one-time cost. Every legitimate evolution of the convention now has to route through the check, and you pay an update tax and a false-positive-triage tax for the life of the gate. The authoring cost is the small number.

The second point is the one that changed how I think. The agent put it more sharply than I had:

Once a green gate exists, people stop eyeballing the thing the gate appears to cover. So a cheap-but-approximate gate over a real invariant can be worse than no gate — it converts active vigilance into misplaced trust.

The dominant hidden failure of a cheap gate isn't noise. It's false confidence. A cheap check usually only approximates the invariant you care about. A text search proves a token is absent; it does not prove a concept was actually renamed everywhere it lives. A fast compile passes against a stale cache. A sub-agent reports "tests passed" and you don't re-run them. In every case the green checkmark covers less than it appears to.

An outside read lands in the same place. Birgitta Böckeler, whose harness vocabulary I lean on below, notes that test feedback is weaker than it looks once "the agent also generated the tests." A verifier the generator authored is not independent of the thing it checks, so a passing suite certifies less than it seems to — the same gap, one layer up.

The rename gate from earlier is a perfect specimen — a structural proxy of exactly the kind I just praised, which makes it the more humbling, because it's one I'm proud of. It greps for the retired word and passes when the word is gone. But "the old word is absent" and "the concept was correctly migrated" are not the same claim. You can delete every instance of the old term and still have left the idea it named half-translated, split across two new words that should have been one, or attached to the wrong entity. The gate is green and the migration is wrong, and now nobody's reading the diff with the old suspicion, because the check says it's handled. The check is real and worth having. It just proves a narrower thing than its green checkmark advertises, and the discipline is to keep knowing the difference instead of letting the green absolve you.

That's the trap in one move. You were watching the thing carefully when nothing claimed to watch it for you. Now something claims to, so you look away, and the gap between what the check proves and what it looks like it proves is exactly where the next bug lives. When those two diverge and the invariant matters, that argues against the cheap gate, not for it.

This isn't an argument against gates. It's an argument for knowing which of your green checkmarks are load-bearing and which are decorative.

It's a ladder, not a switch

The other thing I had been collapsing was the choice itself. I'd been treating it as binary: gate the thing, or stay reactive and fix it by hand when it breaks. There are at least four rungs, and most failures want one of the middle two.

The distinction isn't mine — manufacturing got there decades ago. Shigeo Shingo's poka-yoke, the mistake-proofing of the Toyota Production System, already splits a control that makes the error impossible from a warning that only signals it. The two middle rungs below are those two ideas wearing software clothes. Birgitta Böckeler's harness engineering gives the orthogonal axis: guides that steer the agent before it acts, sensors that observe after. A blocking gate is a sensor with teeth; a default path is a guide that leaves nothing to sense.

The strongest rung isn't a gate at all. It's the default path — Shingo's control type: make the wrong artifact hard to author in the first place. A template, a generator, a structured section that only has slots for the right shape. There's nothing to false-positive on, because you're not checking after the fact, you're removing the way to get it wrong. When a failure is common but a precise gate would be noisy, this beats the gate.

Below the blocking gate sits the tripwire — Shingo's warning type: warn without blocking, or ask for confirmation before an irreversible step. This is the right reach for failures that are catastrophic on first occurrence but genuinely hard to gate cleanly — the data deletion, the leak, the fabrication. You don't need a perfect detector. A loud "are you sure, here's what you're about to overwrite" buys most of the protection at a fraction of the false-positive cost. The mistake is letting "we can't build a clean gate for this" collapse into "so we stay fully reactive" on a failure you can't take twice.

The rung that doesn't work is the one that looks like discipline: a script everyone is supposed to run but nothing enforces. It's neither a gate nor a default path, so under any real deadline it just gets skipped. A rule that lives only in prose is a preference, however firmly phrased.

So the ladder, top to bottom: blocking gate, default path, tripwire, stay reactive. "Should I gate this" was always the wrong question. The question is which rung the failure earns.

The cost test cuts both ways

Because a gate is a standing liability, "earned" is a real filter, and a filter that only ever says yes isn't filtering. The same cost test that licenses a proactive gate also tells you when to leave something ungated, and I find I reach for it in that direction about as often.

A concrete one: I recently changed a documentation convention across the project. A pure convention change, no behavior, no domain term retired. The reflex in a repo that leans this hard on enforcement is to add a check for the new convention. What I wrote instead was an instruction to not:

Don't add a gate or script — this is a convention change only; the enforcement hook stays deferred per the cost test.

The convention isn't load-bearing, a check on it would generate churn every time the convention itself evolves, and the failure if someone gets it wrong is cosmetic. No gate. Not yet, maybe not ever.

The same logic runs in reverse, against gates that already exist. A check family only grows: one iteration of mine added six at once, and nothing in the workflow ever pushes the other way. The tidy instinct is a recurring "prune the gates" ritual at every close. That instinct is right about the pressure and wrong about the mechanism, for two reasons that separate a gate from a stale paragraph of prose. A doc paragraph taxes every session, so doc-bloat is pure, constant rent; a redundant gate taxes only when it runs, so its bloat is mostly latent. And a stale doc is just noise, while a gate is a load-bearing invariant whose removal risks silently dropping a correctness check no one notices is gone. So pruning earns more care and less frequency than tidying prose, not the same recurring slot. A full portfolio review every close would usually find nothing, and a ritual that usually finds nothing is the skipped-discipline trap again, dressed as housekeeping.

Two corollaries fall out. "Merge these two similar gates" defaults to no: two precise checks beat one fuzzy merged one, and a merge is a design act with its own failure mode, not janitorial cleanup. And the counter-pressure that is warranted should itself be mechanized rather than left to willpower — a redundancy meta-check that reads the coupling each gate already declares in its header and flags real overlap, fired on a threshold or a phase boundary, never once per commit. Even pruning gates wants a rung, not a resolution to be tidier.

A good gate even pays a rent rebate. Every rule you load into the agent's always-present instructions is standing context it has to carry, and that context isn't free. A deterministic check lets you delete the paragraph of prose that used to beg the model to remember a rule and replace it with a script plus a reference read only when it fires. The gate does the remembering, so the context window doesn't have to — fewer gates can mean a heavier prompt, not a lighter one.

The rule about gates is itself a gate

The recursion is the part I like best: the system diagnosed the bias on itself.

Everything in this project leans toward enforcement over discipline. The agent named the hazard better than I had:

the whole check-* script family leans hard toward enforce-over-discipline, so a fresh session inherits a "when in doubt, add a check" texture with no stated boundary. That absence is a latent gap.

A bias toward gating with nothing written down about when not to. The honest response, by this essay's own logic, is not to immediately write the boundary rule into the always-loaded instructions. That would be exactly the proactive over-gating the rule warns against. The boundary rule hasn't failed yet.

So we left it ungated, with a concrete trigger: the first time any session ships a gate that turns out high-false-positive or merely approximate, that's the observed failure, and the boundary rule gets promoted then — and into a read-on-demand reference, not the always-loaded file, because it's judgment-grade material rather than a per-session directive. The meta-rule about earning gates is itself being made to earn its place. The failure log is the design document, all the way up.

Honest limitations

This worked for one person, on one roughly 150,000-line codebase, where I own the glossary and the domain model is a single coherent thing in a single head. "Earned from failure" is cheap when the failure is yours and you hit it the same day you build the gate. At team scale the picture is harder: the standing context cost compounds, the false-positive triage falls on people who didn't feel the original pain, and "let it go wrong cheaply once" is a different proposition when the once is someone else's afternoon. I don't have evidence for the team case. I have evidence for the solo case and a suspicion the principles survive the move, which is not the same thing.

One tempting claim I'll resist. I had a hunch, watching the sessions, that the agent resists proposing gates proactively — that it fixes the immediate issue and only designs the enforcement when pushed. There's a clean-looking receipt for it. Asked to prevent a recurrence after I'd approved a fix, the agent replied:

Approved. Let me make the fix first, then design the enforcement check.

Fix first, gate second, only once prompted. But when I went looking for the pattern, the support was mixed. In the flow of a task it does tend to fix first and gate when asked. Hand it the same question as a matter of policy, though, and it argues for more proaction than I'd proposed, not less. So I won't dress that up as a clean behavioral finding. It's a real tension I don't yet understand well enough to assert.

Why this generalizes

None of this is specific to my domain. Any agent operating on a governed system — a customer-service platform, a billing pipeline, anything where words have to mean the same thing across many surfaces — accumulates the same class of drift, and the same four rungs apply.

Picture the Contact Center case concretely, since it's the domain I know best. A single concept gets named in a routing rule, a reporting dashboard, an agent-facing script, and the schema of the system that ties them together. Rename it in three of those four and the fourth keeps quietly emitting the old word, so the dashboard and the router disagree about what they're counting, and nothing crashes. That's the drift, and it's invisible to every test that checks behavior rather than vocabulary. The gate that couples the surfaces is the only thing that catches a disagreement no single component is wrong about. That coupling pattern, applied in both directions, is the subject of a companion essay, Couple Both Ways. An agent makes this worse, not better, because it will cheerfully and fluently use whichever term it last saw, in whichever surface it's editing.

The transferable move is to stop treating your instruction file as the place where correctness lives. Prose is where preferences live. Correctness lives in the gates, default paths, and tripwires you build, each one shaped to a failure specific enough to point at. You don't govern an agent by predicting how it'll go wrong. You let it go wrong cheaply where you can, and you convert each real miss into the cheapest rung that makes it impossible to miss that way again.

This essay was written by directing a coding agent over the project it describes; I direct and judge, the agent drafts and argues back. The argument-back, in this case, is most of beats four and five.

I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're chewing on, I'm reachable on LinkedIn.

References

Vasyl Tretiakov, "Rails, Not Rules: Enforcing a Coding Agent's Domain Vocabulary with Checks," vasyltretiakov.dev, 1 Jun 2026 — companion essay.
Vasyl Tretiakov, "Couple Both Ways: bidirectional checks against silent drift," vasyltretiakov.dev, 9 Jun 2026 — companion essay (the coupling pattern in depth).
Eric Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. ISBN 978-0321125217. (Ubiquitous language and its continuous distillation.)
Birgitta Böckeler, "Harness engineering for coding agent users," martinfowler.com, 2 Apr 2026 (accessed 9 Jun 2026).
Birgitta Böckeler, "From MCP and Vibe Coding to Harness Engineering: How Did AI Native Engineering Evolve in One Year," InfoQ podcast, 8 Jun 2026 (accessed 14 Jun 2026). Source of the "the agent also generated the tests" point.
"Poka-yoke," Wikipedia (accessed 9 Jun 2026). Shigeo Shingo's mistake-proofing; the control-vs-warning split.

Published at vasyltretiakov.dev.

Couple Both Ways: bidirectional checks against silent drift

Vasyl Tretiakov — Tue, 09 Jun 2026 16:50:23 +0000

The coupling-check pattern: catching the drift between any two artifacts that are supposed to agree.

A class diagram in my project claimed that two repository finders were each backed by a remote procedure call (RPC). Neither call existed. The diagram had been generated, reviewed, and committed, and it sailed through every check I had aimed at my diagrams, of which there were already three. It described a system slightly richer than the one I had actually built, and nothing flagged the difference.

It passed because each of those three checks looked at a different axis. One validated that every type named in the diagram resolved to a defined class. One read the architecture diagram's sequence operations. One forced a specific glyph to carry a citation. Each was a perfectly good check. But none of them coupled this claim, a finder asserting a backing call, to the place that claim was supposed to resolve: the actual service definition in the component's wire contract. The drift wasn't inside any one check. It lived in the seam between them, in the one direction nobody had wired up. The fix was a fourth check that closes the seam: every finder in the diagram must name a backing RPC that resolves to a real declaration in that component's protobuf, qualified by component so a pointer to the wrong service still fails, or else be tagged explicitly as proposed-but-unbuilt work. Run against the existing diagram, it found the two finders pointing at nothing. They had been asserting un-built calls, quietly, through every prior gate.

I've come to think that small episode is the most transferable thing in the project, more so than the terminology linter it grew out of, which I wrote up in a companion piece, Rails, Not Rules. If the story stopped at "I wrote a checker for my diagrams," it would be a tip. What made it generalize was noticing that the check was an instance of a shape, and the shape was reusable almost everywhere.

The shape

Call it bidirectional manifest coupling. By manifest I mean any artifact that declares an intent some other artifact is supposed to honor: a glossary entry, a spec section, an edge in an architecture diagram, a service in a wire contract, an item in a work queue, even a document's own table of contents. The defining trait is that a manifest makes a claim about something else. The glossary claims the code uses these words. The diagram claims these components talk over these protocols. The finder claims a call backs it.

A coupling check picks two artifacts that are supposed to stay in agreement, and that drift apart silently because nothing connects them, and it fails when they disagree, in both directions. Often one endpoint is the manifest and the other is plain reality: the on-disk code, the filesystem, the set of services that actually exist. Just as often both endpoints are documents, and the check couples one declaration to another. The glossary and the code are one pair. The diagram and the spec are another. Once you have the move in hand, the pairs are everywhere.

Why both directions

Bidirectionality is the load-bearing word, and it's exactly what the opening story was missing. A one-way check lets the drift relocate to the direction you didn't cover. "Every component named in the diagram exists on disk" is a fine rule, and it would have caught a diagram that referenced a deleted module. It would have done nothing about my two phantom finders, because that drift ran the other way: the diagram asserted a call the code didn't provide. Coverage in one direction quietly grants a license to drift in the other.

So the cleanest checks in the project assert closure both ways at once. The architecture-coverage check is the tidy example. The diagram carries an explicit manifest of its components and edges, and the script asserts that the component set matches the on-disk components exactly, every listed component has a spec and every spec'd component is listed, and that every edge names endpoints the manifest knows about, with the wire-protocol edges further required to resolve to a real service declaration. When I built it, twenty-two components and forty-two edges agreed; a fault-injection test confirmed it goes red on a bogus edge. There is nowhere for a disagreement to hide, because both the presence and the absence of a thing are checked from both sides. That is the actual goal of the whole exercise. Not preventing the agent from ever being wrong, which is hopeless, but guaranteeing that when it is wrong, something turns red before I have to notice by eye.

The first run is a free audit

Building the check is also the first time you find out whether the two artifacts agreed in the first place. Authoring a coupling gate quietly assumes the surfaces it joins are already consistent, and they usually aren't; that latent disagreement is most of why the gate was worth writing. I added a membership check between two restated state-transition tables only after clearing them by eye, reading both and judging them consistent. Its first run failed anyway: three other surfaces agreed that one state could transition to a terminal one, and the table I'd eyeballed as canonical was the single place that had silently dropped the edge. The by-eye clearance slid right over a missing row; the mechanical set-difference did not. So I treat a new coupling check's first-run failures as findings, not as setup noise to quiet down, and I budget the time to chase them. The check earns its keep twice: once as the backfill audit that runs the day you write it, and again as the guard that holds the line afterward.

The pairs are everywhere

Once the shape is in hand it stops being a clever trick and becomes the default move. The commit where I mechanized a completely different pair names it outright: coupling a document's section index to its sections, it describes itself as "the same manifest-coupling pattern as the other check scripts." By then I wasn't inventing anything. I was reaching for a stamp I already owned.

The pairs accumulated fast. The glossary and the code, the original pair, where a retired term must not reappear anywhere the compiler can't see. An architecture diagram and the spec it claims to depict. A class diagram's types and their definitions. The wire contract and the spec that documents its fields, checked for field-level congruence so a renamed field can't drift between the two. A TODO left in the code and the entry that's supposed to track it in the work queue. The tool surface an agent is allowed to call and the underlying RPCs, checked for symmetry so neither grows an orphan. A document's table of contents and the document. Each pair gets one check that says: a claim on this side must resolve to something on that side, and the reverse. None of these is clever in isolation. The leverage is in recognizing that they are all the same check wearing different nouns.

The discipline that keeps it honest

Three things stop this from degrading into a wall of red noise, and all three are worth more than the pattern itself.

The first is that a blocking check has to be false-positive-free, and the reliable way to get there is a single unambiguous syntactic trigger plus opt-in for everything else. In the behavioral-coupling check, the trigger is one diagram glyph that marks a claim as "derived." A claim wearing that glyph must carry a backing citation to a spec section or a proposed task; any other claim may opt in to the same discipline but isn't forced to. Because the glyph is the one thing that fires the rule, the check cannot misfire on prose it was never meant to read. This is the same constraint I'd put behind every gate I trust: if you can't define the trigger crisply enough to be false-positive-free, the check will get bypassed, and a bypassed gate is worse than none. A coupling check that flags honest work teaches everyone to ignore it.

The second is an escape hatch for work that is legitimately ahead of itself. A check that demands two artifacts agree exactly will go red the moment you deliberately let one lead the other, and design-ahead-of-code is a normal, healthy state: a diagram or a spec describing where the system is going before the code catches up. A strict bijection with no relief for that would fire on every honest in-progress diff, which is the red-noise failure arriving by a second route. So each coupling carries a release valve. A claim that hasn't landed yet gets tagged as proposed, with the tag naming the open work item that will close the gap, and the check accepts the divergence as long as the tag is there. That turns a hard "these must match" into "these must match, or the gap is tracked and intentional." The work queue becomes the ledger of what is allowed to be out of sync and why; when the tagged work lands and the tag comes off, the check snaps back to demanding exact agreement and catches anything that merged still mismatched. A bijection you can't ever postpone isn't enforceable on a moving project. A bijection with a tracked exception is.

The third is knowing precisely what the check does and doesn't prove, and the honest answer is narrower than it looks. A coupling check verifies resolution, not correctness. It confirms that a claim points at something real, and it breaks the instant that something is renamed or deleted. It does not, and cannot, tell you the claim points at the right thing. The check's own documentation says this in as many words: the forced citation breaks on rename or removal but does not detect semantic contradiction. My finder check proves each finder names a call that exists; it has no opinion on whether that's the call the finder ought to be using. That sounds like a weakness, and it is a real limit, but it's also most of the value. The overwhelming majority of drift in a fast-moving, agent-built codebase is exactly this: a referent that quietly stopped existing, a name that moved, a manifest describing a system one refactor out of date. Catching all of that mechanically leaves your actual judgment for the question a script was never going to answer.

Prior art, and where this sits

Coupling code to a declared rule is old and well-trodden. ArchUnit lets you write tests that assert layering and dependency direction against compiled Java; dependency-cruiser does the equivalent for JavaScript and TypeScript; the whole tradition of fitness functions, from Ford, Parsons, and Kua's Building Evolutionary Architectures (O'Reilly, 2017), is about automated checks that protect an architectural property as a system evolves. In Birgitta Böckeler's harness-engineering vocabulary, every one of these, mine included, is a sensor: something that observes after the agent acts and lets the work self-correct. I'm not claiming the mechanism.

What I'd add is a direction and a generalization. The established tools aim almost entirely at architecture, couple code to a single rule set, and mostly run one way: the code must obey the manifest. The shape turns out to be more general than that. The second endpoint is frequently not code at all but another document, and the artifacts most prone to silent drift in an agent-built project are precisely the ones the compiler never opens: the diagrams, the glossary that carries the Domain-Driven Design (DDD) ubiquitous language, the wire contract, the work queue, a handbook's index. Those are where an agent's output and its own description of that output come apart, because nothing downstream forces them back together. Pointing the same bidirectional coupling at those pairs, and insisting it run in both directions, is the part I had to work out for myself.

Honest limits

This is one solid personal project, roughly 150,000 lines of Rust across thirty-odd crates and twenty-two components, with one person owning every artifact end to end. Treat it as a worked example, not a benchmark. The economics that make these checks cheap, a single mind deciding what each manifest means, are exactly what I haven't tested under shared ownership.

The pattern also has real edges. The annotations that make a check resolvable, the citation on a diagram glyph, the backing tag on a finder, are themselves a surface that can rot, and they cost something to keep current. False-positive-freeness depends on a clean syntactic trigger existing; where no such trigger is available, you fall back to opt-in or to plain discipline, with all the looseness that implies. And the resolution-not-correctness limit is permanent, not a bug to be fixed later: these checks guarantee your manifests describe a system that exists, never that it's the system you should have built. What they buy back is the attention you'd otherwise spend chasing stale references by eye, and in a codebase that an agent is reshaping several times a week, that turns out to be most of the attention you have.

There is a subtler failure than red noise, and it's the one I'd warn a reader about hardest: a coupling check is only as good as the channel it actually joins. I once wrote a check coupling alerting rules to the code that emits the metrics they name, and it passed clean: every rule named a real metric. But those values traveled two pipes that happened to share one namespace, and five rules named metrics emitted only on the pipe the alert engine never reads. The strings matched perfectly while the feature underneath was dead. The check was green over a coupling that didn't physically connect. Picking the hub, the specific producer the consumer truly reads from rather than every place the string appears, is its own design step, and getting it wrong buys you false confidence, which is worse than the false noise the discipline above is built to avoid.

I build domain-dense systems by directing coding agents, where the vocabulary and the contracts have to stay honest as the code moves under them. If that's a seam you're working too, I'm reachable on LinkedIn.

Written by directing an AI agent, the same way the platform it describes was built. The editing and the judgment are mine.

References

Vasyl Tretiakov, "Rails, Not Rules: Enforcing a Coding Agent's Domain Vocabulary with Checks," vasyltretiakov.dev, 1 Jun 2026 — companion essay.
Neal Ford, Rebecca Parsons, and Patrick Kua, Building Evolutionary Architectures: Support Constant Change. O'Reilly, 2017. ISBN 978-1491986356.
Birgitta Böckeler, "Harness engineering for coding agent users," martinfowler.com, 2 Apr 2026 (accessed 9 Jun 2026).
ArchUnit — architecture unit-testing for Java (accessed 9 Jun 2026).
dependency-cruiser — validate and visualize dependencies for JavaScript and TypeScript (accessed 9 Jun 2026).
Eric Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. ISBN 978-0321125217.

Published at vasyltretiakov.dev.

Rails, Not Rules: enforcing a coding agent's domain vocabulary with checks

Vasyl Tretiakov — Mon, 01 Jun 2026 12:28:44 +0000

Why I stopped telling my coding agent the domain language and started enforcing it.

I wrote my coding agent's rules for the project's terminology as prose on April 24. I wrote a script to check those rules on May 2. The first run found 737 violations, in a codebase of roughly 150,000 lines that the agent had written under my direction and that I had, ostensibly, been keeping consistent the whole way.

It's worth being precise about what those 737 were, because the easy reading (the agent ignored my rules) is wrong, and the real story is more useful. I'd adopted a spec-driven way of working. When the domain language needed sharpening, a rename began in the glossary, with the old term retired and the new one defined; from there it flowed into spec amendments, and only then reached the code. Each rename was supposed to be total. None of them was. A rename propagates cleanly through the parts a compiler can see, and leaves a residue everywhere it can't: a directory still named after the old concept, a string in the frontend, a term in a comment, and often enough in the specs themselves, even though the specs are far smaller than the code and were being actively rewritten as part of the rename. I kept catching that residue by eye, one piece at a time, with the uneasy sense that I was only seeing a fraction of it. The 737 was the first time I built something that could count the rest. The rules in the file weren't wrong, and weren't being defied. They just couldn't see the artifacts they governed: not the code, and not even the specs being rewritten right alongside it. A sentence in a markdown file is a wish. The check was the first thing in the project that could tell me how far the wish was from true.

This essay is about the move that number forced: from writing the domain language down to enforcing it mechanically. Not as a productivity hack, but as the thing that turned out to actually keep a fast-moving, agent-built codebase consistent with its own vocabulary. If you've written a careful instructions file for an AI coding tool and watched the code drift out from under it anyway, this is the same story with a receipt attached, and a more useful diagnosis than "the model won't listen."

Prose doesn't bind

There's a familiar version of this. I wrote 200 lines of rules and the model ignored them all. The usual explanations are context-window pressure, instruction-following limits, the rules competing with everything else in the prompt. Probably all true. But framing it as the model misbehaving misses the more general mechanism, which has nothing to do with whether the agent is trying to comply. Even a perfectly diligent collaborator, human or model, can't apply a rule to a surface they don't know has changed. The rule's presence in the prompt and the rule's effect on the codebase are two different facts, and the gap between them stays invisible exactly where nothing measures it.

The most striking version of this I have on record came from the agent itself. Mid-project, after one more written reminder had failed to stick, I asked it point-blank whether putting the rule into its persistent instructions would be enough. Its answer argued against its own convenience:

Honest answer: no. … it's just discipline with extra steps … it only works if I read it and then choose correctly in the moment.

That is the whole thesis, volunteered by the system the thesis is about. A written instruction it is free to not act on is not a constraint, however firmly phrased. It's a preference. The fix can't be a better-worded preference; it has to be something that doesn't depend on the agent choosing right in the moment.

The retrospective receipt arrived the same week. I had the agent split its own instructions file in two and mechanize the rules that had been pure prose. The commit message (written, like nearly all of them in this project, by the agent reflecting on its own behavior) says why in five words I keep coming back to:

misses clustered on un-gated prose rules.

When it mapped where the codebase had actually fallen out of step with its own rules, the misses weren't randomly distributed. They clustered on exactly the rules that had nothing behind them but having been written down. The rules that did have a check, a script or a hook or anything that could fail a run, were not where the drift lived. The pattern has nothing to do with defiance. It's about consequence. A rule with no mechanical consequence is indistinguishable, from inside the work, from no rule at all.

A single episode made this concrete. The instructions stated, in plain language, that a rename must sweep every surface in lockstep: old name retired everywhere, no stragglers (the agent's own phrasing). It's hard to write a clearer rule. And a rename still went out leaving a batch of sites unswept, not only in the stringy corners of the code a compiler never reads, but in the specs themselves, the very documents being rewritten to drive the rename. What caught it wasn't anyone re-reading the rule. It was the terminology check flagging the leftover occurrences of the old term, two work-slices after the rename had supposedly finished. The durable fix wasn't to restate the rule more firmly. It was to add a rename-completeness gate that sweeps every tracked surface before a rename can be called done. Prose discipline failed; the gate held. Once you've watched that happen to a rule you'd have sworn was unambiguous, "just write it down more clearly" stops being a credible plan.

That's what you'd expect, too, if you think about what's actually steering the generation. The model optimizes toward plausible, working-looking code. Just as important, it optimizes toward a manageable unit of it: it scopes each task to a meaningful chunk that fits its context budget, aiming for a sensible increment rather than an exhaustive pass it never promised, and it does not read every spec line and every file before acting. Completeness across every surface is precisely the guarantee it doesn't offer. Against that, a glossary entry that says call it disposition, not outcome_code is a soft constraint with no gradient. Nothing pushes back when it's missed. The compiler doesn't care. The tests stay green either way. So the constraint quietly loses to every harder signal around it, and the gap surfaces much later, by eye, one stray term at a time.

The conclusion is uncomfortable and, I think, correct: a non-negotiable rule cannot live in a prompt. It has to be a check that fails the moment the rule is broken. In this project the checks run as pre-commit hooks, so the gate trips before a change can be committed. For code changes the compiler has usually already run and passed; that was never the question. The gate checks the thing the compiler can't. But the principle is the point: if a rule isn't enforced by something that can fail, it isn't a rule. It's a preference you're paying an attention-tax to maintain.

Standing on others' shoulders

I'm not the first to land on "enforce, don't suggest," and I want to credit the people who got there first, because the idea is genuinely theirs and the piece I'm adding sits on top of it.

Factory.ai's Alvin Sng made the case for using linters to direct agents: encode the conventions as lint rules the agent runs against, rather than prose it has to remember. InfoQ has been developing the "architectural governance at AI speed" argument, with fitness functions and machine-enforceable statements of architectural intent so oversight can keep pace with generation. And Birgitta Böckeler's "harness engineering" gives the cleanest vocabulary I've found: a harness has guides, which steer the agent before it acts, and sensors, which observe after it acts and let it self-correct. In those terms, a prose rule is a guide with no sensor behind it. None of these authors would be surprised by my 737; the framing was waiting for me when I went looking.

There's a deeper reason this works, and it's the thing I'd add to the conversation. A coding agent is a non-deterministic instrument with a real, non-zero cost per call: wonderful at creative, ambiguous work, but variable and dear when you lean on it for things that have a single right answer. A neighboring line of work, LLM-as-judge, leans into that by having one model grade another. It's a complementary track, and not the one this essay takes, but it has landed independently on the same split I did. (I arrived at it from the work, not from their write-ups; I found the literature afterward and was glad to see the agreement.) As Braintrust puts it, deterministic checks should handle "everything that can be measured directly" because they're "inexpensive and predictable," while an LLM judge is reserved for the subjective dimensions that genuinely need language understanding. "Did this rename reach every surface?" is not a subjective question. So rather than reach for a second model to watch the first, I push it onto a formal, deterministic tool: a script. That isn't just cheaper and more reliable than asking the model to remember. It frees the model's headroom for the work only it can do. Every rule you move from the probabilistic plane to the formal one buys back creative capacity you were spending on bookkeeping.

So where does that leave the contribution? The prior art aims this enforcement mostly at architecture: layering, dependency direction, the shape of the system. That's the natural first target, since a forbidden import is a one-line grep and an architectural violation is loud. What I found is that the same enforcement belongs one layer in, at the thing Domain-Driven Design (DDD) says actually carries the meaning: the ubiquitous language (UL), the shared vocabulary that runs from a conversation with a domain expert all the way to a class name. And Eric Evans, in Domain-Driven Design (Addison-Wesley, 2003), is emphatic that this language is not a glossary you write once. It is crunched and distilled continuously; a ubiquitous language, he writes, must evolve as the team's understanding sharpens. That evolution was the engine of all my renames.

Here's the twist that's specific to doing DDD with an agent. An LLM is, in one sense, a gift to Evans' program: it makes evolving the language dramatically faster and cheaper, so you can carry a rename through specs and code in an afternoon that would once have been a week's careful refactor. But the same speed multiplies the failure mode. Every fast, agent-driven rename sheds residue on the surfaces the agent didn't think to sweep, and it sheds it faster than you can track by eye. The tool that finally makes continuous distillation practical is the same tool that makes its incompleteness dangerous. That is the edge I hit, and it's why a static instructions file is a liability here in particular: the words it pins down are exactly the words you're now changing several times a week. The obvious move, the one I reached for first, is to feed the glossary to the model as context and trust it to comply. That's UL-as-context, and it's precisely the prose-rule that produced my 737. The wedge isn't "enforce the UL instead of the architecture." It's that enforcement belongs at the UL layer too, and, once you have the habit, at every layer where two artifacts are supposed to agree.

The case study: a linter for meaning

The first gate was the terminology check that produced the 737. I built it because terminology inconsistency was already a known, nagging problem, one I'd written down as something we were struggling with. The same concept kept showing up under two names across code and docs, accumulating faster than I could chase it. So the 737 wasn't a gotcha. It was the first honest measurement of a debt I already knew I was carrying. Mechanically the check is unglamorous: it scans the source (Rust, markdown, config, protos, the TypeScript frontend) for retired names and for terms that don't match the glossary. It's a grep, effectively instant, and it runs automatically, on a hook when the agent reads a spec and again before any commit can land.

The question worth dwelling on is why it has to exist at all, given everything already guarding the code. I wasn't relying on the agent's memory alone. I had a compiler, a test suite, a glossary (the written home of the ubiquitous language) that named every old and new term, spec amendments derived from that glossary, and separate implementation sessions that turned those specs into code. That is a lot of discipline. And the residue still got through, because every one of those guardrails operates either above the code, like the glossary and the specs, or below the level of meaning, like the compiler, which type-checks structure and not vocabulary. A struct called OutcomeCode and a struct called Disposition are equally valid Rust; only one of them is the agreed word. The terminology check occupies the band nothing else watched: string forms, enum variants, names in comments and specs, the directory called after a concept that no longer exists, the retired term that reads fine and so sails through review, including the agent's review of its own work. It is a linter for meaning, and meaning is where domain bugs are born.

One concrete instance taught me the mechanism better than any amount of theory. I renamed a module, and weeks later noticed, by eye, that a directory was still named after the old one. The rename had reached every place the compiler and the IDE's rename tooling could follow, and stopped at the edge of what they understand: a folder name is just a string. My first reaction is the part that matters. I didn't think I should be more careful. I thought: did I forget to register the old name as retired, so the check skipped right over it? That was exactly it. The check hadn't failed. It had faithfully looked for what it had been told to look for, and the residue was sitting where no rule yet pointed. The lesson isn't "even one word doesn't bind." It's sharper than that: a rename is only as complete as the mechanical check that can see the surfaces the compiler can't, and that check sees only what the glossary has taught it to see.

Which is also why the honest limits matter, and why I keep them on the record. At one point I needed to retire a term whose name collided with an unrelated field in an audit schema, and no mechanical rule could tell the two uses apart. The commit, again in the agent's own words, recorded the decision to deliberately not add that rule, on the grounds that you skip a rule rather than ship one that over-flags. That cuts in an inconvenient direction for a tidy thesis: defining a good formal check is not always easy, and sometimes it isn't possible. But the reasoning is right, and it's sharper when the check is a hard gate. Because these checks block commits, a false positive can't simply be tuned out the way a noisy warning is. It does something worse. It forces you to bypass the gate, or to water the rule down until it stops flagging, and either way the gate quietly loses its authority, the very thing that made it worth more than prose. An over-flagging rule isn't a minor annoyance to live with. It's a slow repeal of the mechanism. That's why skipping the rule was the disciplined call. The blind spots, the generic terms you must skip and the placements the scan can't reach, aren't a footnote to apologize for. They're the boundary of the technique, and naming them honestly is what keeps a green checkmark meaning something. The gates are never finished, either: each real miss teaches the glossary to see one more thing, and the boundary moves outward over time.

Scope and honest limits

A word on the scope, because the setting does a lot of the work. This is a personal, still-evolving project, built entirely by directing agents, with one person, me, owning the ubiquitous language end to end. Treat what follows as a worked example, not a benchmark.

Within that scope the result is clear, and it's the part I find interesting. Enforcing the language mechanically does what prose never managed. It holds the vocabulary consistent across a fast-moving, agent-written codebase, turns a class of silent regressions into loud ones, and gives me back the attention I'd otherwise spend on by-eye policing, attention that goes into design instead. The project is still under active development, and the gates earn their place daily. That is the contribution I stand behind.

The principle travels: unenforced rules don't bind, and a ubiquitous language is worth enforcing, not merely describing. What I haven't tested is what shared ownership does to it. My checks are cheap partly because one mind decides what a word means. The moment several people co-own the glossary, new questions open up: who arbitrates a contested term, what a growing body of checks costs to maintain, whether false-positive fatigue sets in and people start bypassing gates. I raise them because they're genuinely open, not because I think they sink the idea.

One point cuts against the obvious worry, that all this governance must be expensive in agent tokens. In my experience it ran the other way. Before the gates, keeping the language consistent meant re-running broad, non-deterministic refactor passes and still not trusting the result: high cost, uneven quality. Formal checks gave a deterministic, good-enough verdict, so I stopped paying for repeat LLM passes I didn't believe in. They also let me shrink the always-loaded instructions. The rules got short, and the edge-case detail moved into an on-demand handbook that a failing check's output points to, exactly when it's relevant. Cheap scripts standing in for repeated probabilistic passes moved the net cost down, not up, which fits the earlier point about shifting work off the probabilistic plane.

Where this came from

The starting goal was just a domain-dense Contact Center (CC) platform, built by directing coding agents. The vocabulary kept slipping, and the technique fell out of that collision. I care about methodology in general, which is why I invested in polishing what I found and adding a few pieces of my own, but I didn't go looking for one here. It came from the seam: deep domain language on one side, AI-assisted engineering on the other, and the discovery that the seam holds only when the rules can fail on their own rather than depend on anyone, human or model, remembering them.

That intersection, Contact Center domain rigor met with agent-driven delivery, is where I spend my time now. If it's a problem you're working on too, I'm reachable on LinkedIn.

Written by directing an AI agent, the same way the platform it describes was built. The editing and the judgment are mine.

References

Alvin Sng, "Using Linters to Direct Agents," Factory.ai, 5 Sep 2025. https://factory.ai/news/using-linters-to-direct-agents (accessed 1 Jun 2026).
"Architectural Governance at AI Speed," InfoQ, 26 Mar 2026. https://www.infoq.com/articles/architectural-governance-ai-speed/ (accessed 1 Jun 2026).
Birgitta Böckeler, "Harness engineering for coding agent users," martinfowler.com, 2 Apr 2026. https://martinfowler.com/articles/harness-engineering.html (accessed 1 Jun 2026).
"What is an LLM-as-a-judge?" Braintrust, 26 Feb 2026. https://www.braintrust.dev/articles/what-is-llm-as-a-judge (accessed 1 Jun 2026).
Eric Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. ISBN 978-0321125217.