DEV Community

Vasyl Tretiakov
Vasyl Tretiakov

Posted on • Originally published at vasyltretiakov.dev

Gates Earned From Failure: a cost test for agent guardrails

Every guardrail in my agent-built project was earned from a real failure, not designed up front. A cost test for when to build one, when to wait, and when to retire it.

On a Wednesday in late May I caught a bug by reading. The project's glossary — the canonical list of the domain terms my coding agent is required to use — had drifted from the domain model I actually carried in my head. Nothing flagged it. No test failed, no check fired, no compiler complained. I noticed because I happened to be reading the file and the words were wrong.

What I typed next is the whole argument in one line. Not "fix the glossary," not "I'll be more careful," but a question aimed at the toolchain instead of the error:

the GLOSSARY already drifted away from my domain model vision and I would like to prevent this in future refactors

That reframes the task. The drift stops being an error to correct once and becomes a signal about the toolchain: if I caught a class of drift by reading, an enforcement axis is missing. The fix isn't to read harder next time. Reading harder is just minting another rule in your head, another wish. The fix is to add the check that couples the two things that drifted, and then never spend that attention again.

I did not sit down and design a governance system, and that's the most transferable thing here. Every check in this project — a couple dozen now, wired as pre-commit hooks — was born from a particular drift I'd already been bitten by. The ordering is the point. The failure came first; the gate came second, shaped to the exact failure.

Renames stopped being discipline

The clearest place this showed up was renaming. (I told the measurement side of this story in a companion piece, Rails, Not Rules: the first time I scripted a terminology check it found 737 violations in code I'd thought I was keeping clean. This is the other half — not the count, but the habit that grew from it.) Early on, retiring a domain term meant what it means everywhere: search, replace, move on. Then a retired term came back. Not because anyone re-typed it, but because a rename had swept the code and missed the surfaces a compiler never reads — a spec document, a generated config, a comment that mattered. Weeks later the old word resurfaced through a path I hadn't thought to look at.

The detail that stung was where the residue hid. In one rename, the leftover instances of the old term weren't in some forgotten corner of the codebase. They were in the specification documents, the very files being rewritten to drive the rename in the first place. The agent, under my direction, was editing the spec to declare the new vocabulary while leaving the old vocabulary scattered through the same spec. The document asserting the rename was itself out of compliance with it. You cannot out-discipline that, and watching the diff doesn't save you either: no amount of review reliably catches the contradiction being introduced in the very pass that's concentrating on the change. Only a check that reads the document back to you after the fact will, and that's what eventually went in.

It helps to be clear about why renames recur at all, because it isn't sloppiness. The glossary is the written home of what Domain-Driven Design (DDD) calls the ubiquitous language (UL): the single vocabulary that runs from a domain expert's sentence to a class name. Eric Evans is emphatic that this language is not written once and frozen; it is continuously distilled, sharpened every time the team's understanding improves. Each sharpening is a rename. So the renames are the system working as designed, not a mess to stamp out — which is exactly why the residue problem is permanent and worth a gate rather than a resolution to be tidier.

After that, a rename was never just "use the new word now." It became three things that ship together: rename the term, add a glossary entry recording the old word as retired, and add a check keyed to that retired word so it can't silently return. The rename and the rail land in the same commit. Skip the rail and you're trusting that every future sweep will be perfect, which is the assumption that just failed.

So far this is a tidy story: let each gate be earned by a real, observed drift, and your instruction file never bloats into the 200-line document nobody, human or model, can hold in their head. A rule you add from imagination is a guess about a failure that may never come. A rule you add from a failure you actually hit is paid for. It has a body behind it. The failure log is the design document.

I believed that cleanly for about a month. Then I had to argue with my own agent about it, and the rule came out more interesting than it went in.

The exceptions that earn a gate early

The trouble with "wait for the drift" is that some drifts you cannot afford to observe even once. If the first occurrence is a deleted production table, or customer data in a log line, or a fabricated citation in something already published, "let it go wrong cheaply once" is a contradiction. There is no cheap once.

I put this to the agent as a proposed amendment: build a gate proactively when the failure mode is already well-attested and the check is cheap and low-false-positive, or when the first occurrence is expensive or irreversible. Stay reactive when the rule is judgment-heavy, the convention is still in flux, or the drift is cheap to catch and fix once. The reasoning is mundane risk math, not ideology: you pre-pay for a gate exactly when the expected cost of waiting exceeds the standing cost of the gate.

This repo runs on that amendment, and I can point at two receipts. The writing project, the one these essays live in, has a check that blocks any cited URL not logged in an evidence file. I built it before it had earned itself in the usual reactive way, because the failure it guards is the expensive kind and it had already happened once: a fabricated citation reached a draft. A published fabrication doesn't get a cheap first occurrence. So the gate went in proactively, licensed by the cost test: is the failure high enough cost, and is the check free of false positives?

The second receipt is the other half of the amendment, the well-attested-and-cheap case rather than the irreversible one. The same writing project runs a prose check on these very drafts, flagging the small set of mechanical tells that mark machine-generated text. That failure mode wasn't hypothetical and it wasn't expensive-once; it was a known, recurring, deterministic pattern I'd have to correct on every essay forever. Well-attested, cheap to detect, low false-positive: that's the second carve-out, and it's the reason this paragraph can't lean on more than two em-dashes without the gate stopping the commit. The check is the essay's own argument applied to the essay. The rule "gates are earned" survives both times, with named carve-outs for failures you can't afford to rehearse and failures you've already seen enough times to name precisely.

When a rule won't gate, find its shadow

Between "earn it from a failure" and "stay reactive" sits the move that does most of the real work, and it's the one I'd most want a reader to steal. A lot of rules look un-gateable because the thing they protect is a meaning, and a script can't read meaning. "Don't restate the protocol prose in two places" isn't grep-able. Neither is "don't leave a dangling reference." The reflex is to give up and write the rule into an instructions file as a wish. The better move is to stop checking the rule and check a structural proxy for it: a syntactic shadow the real rule casts, one a script can see and that almost never flags an innocent.

It works more often than it has any right to. "Don't duplicate the protocol in prose" became "no narrative text in column zero of these files," because the protocol lived in indented blocks and unindented prose was the tell. "No dangling references" became "every TODO( carries a live task slug." The em-dash rule guarding these very drafts is the same trick: I can't gate "don't write like a language model," but I can gate "no paragraph carries more than two em-dashes," which is a measurable shadow of the thing I actually want. Each time, a rule I'd have sworn was judgment-only turned out to cast a shape a grep could catch.

The proxy is never the rule, though. A structural shadow is an approximation by construction, so every proxy gate ships with a built-in gap between what it checks and what you wish it proved. That gap is the next failure.

The failure that hides inside a green checkmark

Here is the part I got wrong, and the agent caught.

I had been treating "cheap to write" as most of what makes a gate worth building. The agent pushed back on two fronts. First, cheap has to mean cheap to maintain, not cheap to write. A gate is a standing liability, not a one-time cost. Every legitimate evolution of the convention now has to route through the check, and you pay an update tax and a false-positive-triage tax for the life of the gate. The authoring cost is the small number.

The second point is the one that changed how I think. The agent put it more sharply than I had:

Once a green gate exists, people stop eyeballing the thing the gate appears to cover. So a cheap-but-approximate gate over a real invariant can be worse than no gate — it converts active vigilance into misplaced trust.

The dominant hidden failure of a cheap gate isn't noise. It's false confidence. A cheap check usually only approximates the invariant you care about. A text search proves a token is absent; it does not prove a concept was actually renamed everywhere it lives. A fast compile passes against a stale cache. A sub-agent reports "tests passed" and you don't re-run them. In every case the green checkmark covers less than it appears to.

An outside read lands in the same place. Birgitta Böckeler, whose harness vocabulary I lean on below, notes that test feedback is weaker than it looks once "the agent also generated the tests." A verifier the generator authored is not independent of the thing it checks, so a passing suite certifies less than it seems to — the same gap, one layer up.

The rename gate from earlier is a perfect specimen — a structural proxy of exactly the kind I just praised, which makes it the more humbling, because it's one I'm proud of. It greps for the retired word and passes when the word is gone. But "the old word is absent" and "the concept was correctly migrated" are not the same claim. You can delete every instance of the old term and still have left the idea it named half-translated, split across two new words that should have been one, or attached to the wrong entity. The gate is green and the migration is wrong, and now nobody's reading the diff with the old suspicion, because the check says it's handled. The check is real and worth having. It just proves a narrower thing than its green checkmark advertises, and the discipline is to keep knowing the difference instead of letting the green absolve you.

That's the trap in one move. You were watching the thing carefully when nothing claimed to watch it for you. Now something claims to, so you look away, and the gap between what the check proves and what it looks like it proves is exactly where the next bug lives. When those two diverge and the invariant matters, that argues against the cheap gate, not for it.

This isn't an argument against gates. It's an argument for knowing which of your green checkmarks are load-bearing and which are decorative.

It's a ladder, not a switch

The other thing I had been collapsing was the choice itself. I'd been treating it as binary: gate the thing, or stay reactive and fix it by hand when it breaks. There are at least four rungs, and most failures want one of the middle two.

The distinction isn't mine — manufacturing got there decades ago. Shigeo Shingo's poka-yoke, the mistake-proofing of the Toyota Production System, already splits a control that makes the error impossible from a warning that only signals it. The two middle rungs below are those two ideas wearing software clothes. Birgitta Böckeler's harness engineering gives the orthogonal axis: guides that steer the agent before it acts, sensors that observe after. A blocking gate is a sensor with teeth; a default path is a guide that leaves nothing to sense.

The strongest rung isn't a gate at all. It's the default path — Shingo's control type: make the wrong artifact hard to author in the first place. A template, a generator, a structured section that only has slots for the right shape. There's nothing to false-positive on, because you're not checking after the fact, you're removing the way to get it wrong. When a failure is common but a precise gate would be noisy, this beats the gate.

Below the blocking gate sits the tripwire — Shingo's warning type: warn without blocking, or ask for confirmation before an irreversible step. This is the right reach for failures that are catastrophic on first occurrence but genuinely hard to gate cleanly — the data deletion, the leak, the fabrication. You don't need a perfect detector. A loud "are you sure, here's what you're about to overwrite" buys most of the protection at a fraction of the false-positive cost. The mistake is letting "we can't build a clean gate for this" collapse into "so we stay fully reactive" on a failure you can't take twice.

The rung that doesn't work is the one that looks like discipline: a script everyone is supposed to run but nothing enforces. It's neither a gate nor a default path, so under any real deadline it just gets skipped. A rule that lives only in prose is a preference, however firmly phrased.

So the ladder, top to bottom: blocking gate, default path, tripwire, stay reactive. "Should I gate this" was always the wrong question. The question is which rung the failure earns.

The cost test cuts both ways

Because a gate is a standing liability, "earned" is a real filter, and a filter that only ever says yes isn't filtering. The same cost test that licenses a proactive gate also tells you when to leave something ungated, and I find I reach for it in that direction about as often.

A concrete one: I recently changed a documentation convention across the project. A pure convention change, no behavior, no domain term retired. The reflex in a repo that leans this hard on enforcement is to add a check for the new convention. What I wrote instead was an instruction to not:

Don't add a gate or script — this is a convention change only; the enforcement hook stays deferred per the cost test.

The convention isn't load-bearing, a check on it would generate churn every time the convention itself evolves, and the failure if someone gets it wrong is cosmetic. No gate. Not yet, maybe not ever.

The same logic runs in reverse, against gates that already exist. A check family only grows: one iteration of mine added six at once, and nothing in the workflow ever pushes the other way. The tidy instinct is a recurring "prune the gates" ritual at every close. That instinct is right about the pressure and wrong about the mechanism, for two reasons that separate a gate from a stale paragraph of prose. A doc paragraph taxes every session, so doc-bloat is pure, constant rent; a redundant gate taxes only when it runs, so its bloat is mostly latent. And a stale doc is just noise, while a gate is a load-bearing invariant whose removal risks silently dropping a correctness check no one notices is gone. So pruning earns more care and less frequency than tidying prose, not the same recurring slot. A full portfolio review every close would usually find nothing, and a ritual that usually finds nothing is the skipped-discipline trap again, dressed as housekeeping.

Two corollaries fall out. "Merge these two similar gates" defaults to no: two precise checks beat one fuzzy merged one, and a merge is a design act with its own failure mode, not janitorial cleanup. And the counter-pressure that is warranted should itself be mechanized rather than left to willpower — a redundancy meta-check that reads the coupling each gate already declares in its header and flags real overlap, fired on a threshold or a phase boundary, never once per commit. Even pruning gates wants a rung, not a resolution to be tidier.

A good gate even pays a rent rebate. Every rule you load into the agent's always-present instructions is standing context it has to carry, and that context isn't free. A deterministic check lets you delete the paragraph of prose that used to beg the model to remember a rule and replace it with a script plus a reference read only when it fires. The gate does the remembering, so the context window doesn't have to — fewer gates can mean a heavier prompt, not a lighter one.

The rule about gates is itself a gate

The recursion is the part I like best: the system diagnosed the bias on itself.

Everything in this project leans toward enforcement over discipline. The agent named the hazard better than I had:

the whole check-* script family leans hard toward enforce-over-discipline, so a fresh session inherits a "when in doubt, add a check" texture with no stated boundary. That absence is a latent gap.

A bias toward gating with nothing written down about when not to. The honest response, by this essay's own logic, is not to immediately write the boundary rule into the always-loaded instructions. That would be exactly the proactive over-gating the rule warns against. The boundary rule hasn't failed yet.

So we left it ungated, with a concrete trigger: the first time any session ships a gate that turns out high-false-positive or merely approximate, that's the observed failure, and the boundary rule gets promoted then — and into a read-on-demand reference, not the always-loaded file, because it's judgment-grade material rather than a per-session directive. The meta-rule about earning gates is itself being made to earn its place. The failure log is the design document, all the way up.

Honest limitations

This worked for one person, on one roughly 150,000-line codebase, where I own the glossary and the domain model is a single coherent thing in a single head. "Earned from failure" is cheap when the failure is yours and you hit it the same day you build the gate. At team scale the picture is harder: the standing context cost compounds, the false-positive triage falls on people who didn't feel the original pain, and "let it go wrong cheaply once" is a different proposition when the once is someone else's afternoon. I don't have evidence for the team case. I have evidence for the solo case and a suspicion the principles survive the move, which is not the same thing.

One tempting claim I'll resist. I had a hunch, watching the sessions, that the agent resists proposing gates proactively — that it fixes the immediate issue and only designs the enforcement when pushed. There's a clean-looking receipt for it. Asked to prevent a recurrence after I'd approved a fix, the agent replied:

Approved. Let me make the fix first, then design the enforcement check.

Fix first, gate second, only once prompted. But when I went looking for the pattern, the support was mixed. In the flow of a task it does tend to fix first and gate when asked. Hand it the same question as a matter of policy, though, and it argues for more proaction than I'd proposed, not less. So I won't dress that up as a clean behavioral finding. It's a real tension I don't yet understand well enough to assert.

Why this generalizes

None of this is specific to my domain. Any agent operating on a governed system — a customer-service platform, a billing pipeline, anything where words have to mean the same thing across many surfaces — accumulates the same class of drift, and the same four rungs apply.

Picture the Contact Center case concretely, since it's the domain I know best. A single concept gets named in a routing rule, a reporting dashboard, an agent-facing script, and the schema of the system that ties them together. Rename it in three of those four and the fourth keeps quietly emitting the old word, so the dashboard and the router disagree about what they're counting, and nothing crashes. That's the drift, and it's invisible to every test that checks behavior rather than vocabulary. The gate that couples the surfaces is the only thing that catches a disagreement no single component is wrong about. That coupling pattern, applied in both directions, is the subject of a companion essay, Couple Both Ways. An agent makes this worse, not better, because it will cheerfully and fluently use whichever term it last saw, in whichever surface it's editing.

The transferable move is to stop treating your instruction file as the place where correctness lives. Prose is where preferences live. Correctness lives in the gates, default paths, and tripwires you build, each one shaped to a failure specific enough to point at. You don't govern an agent by predicting how it'll go wrong. You let it go wrong cheaply where you can, and you convert each real miss into the cheapest rung that makes it impossible to miss that way again.


This essay was written by directing a coding agent over the project it describes; I direct and judge, the agent drafts and argues back. The argument-back, in this case, is most of beats four and five.

I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're chewing on, I'm reachable on LinkedIn.


References


Published at vasyltretiakov.dev.

Top comments (0)