<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vasyl Tretiakov</title>
    <description>The latest articles on DEV Community by Vasyl Tretiakov (@vasyltretiakov).</description>
    <link>https://dev.to/vasyltretiakov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3962696%2Fbc2d96fa-fec7-4888-bef4-60cf52ebb67d.jpg</url>
      <title>DEV Community: Vasyl Tretiakov</title>
      <link>https://dev.to/vasyltretiakov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vasyltretiakov"/>
    <language>en</language>
    <item>
      <title>Compiling the Process, Not the Code: a machine-checked workflow for coding agents</title>
      <dc:creator>Vasyl Tretiakov</dc:creator>
      <pubDate>Wed, 17 Jun 2026 07:43:53 +0000</pubDate>
      <link>https://dev.to/vasyltretiakov/compiling-the-process-not-the-code-a-machine-checked-workflow-for-coding-agents-3agg</link>
      <guid>https://dev.to/vasyltretiakov/compiling-the-process-not-the-code-a-machine-checked-workflow-for-coding-agents-3agg</guid>
      <description>&lt;p&gt;&lt;em&gt;The deterministic checks guarding an agent-built project are a compiler for the workflow — and the load-bearing half of them compile the process, not the code.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Halfway through a Tuesday in early June, after I'd spent the morning turning a handful of written-down conventions into scripts that block a commit when they're violated, I typed a half-formed thought at my coding agent instead of another task:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;it feels like with script gates we're building a &lt;code&gt;compiler&lt;/code&gt; for our coding-agent-enabled workflow&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There were around three dozen of these checks in the project by then. None of them compiles anything in the usual sense. They grep, they parse, they diff one artifact against another and exit nonzero when two things that must agree don't. But the feeling was specific and it wouldn't go away, so I want to take it seriously: what is true about it, and exactly where it stops being true. The place it breaks is the useful part.&lt;/p&gt;

&lt;h2&gt;
  
  
  The asymmetry under the feeling
&lt;/h2&gt;

&lt;p&gt;The agent's first move was to tell me "compiler" was the wrong word for the right intuition. The precise version is a &lt;em&gt;generator/verifier&lt;/em&gt; split, and a compiler is one instance of it.&lt;/p&gt;

&lt;p&gt;An LLM is a high-throughput, low-consistency generator. It is prolific and genuinely creative, and it is unreliable about rule forty-seven under context pressure. It will use whichever term it saw most recently, in whichever file it's editing, and be fluent and confident while doing it. The cheapest way to make an unreliable generator reliable is not to lecture the generator. It's to put a deterministic, cheap &lt;em&gt;verifier&lt;/em&gt; in front of it and exploit an asymmetry: checking that a constraint holds is far cheaper, and far more trustworthy, than generating output that satisfies it on the first try every time. A compiler is that pattern frozen around a language. What I was building was the verifier sized to my generator's specific failure modes.&lt;/p&gt;

&lt;p&gt;I should concede immediately that none of that asymmetry is new. It is the oldest idea in our tooling. Type checkers, linters, continuous integration, and the humble test suite all exist because verifying a property is cheaper than producing code guaranteed to have it. The compiler metaphor isn't new either; people have reached for it about prompts and pipelines for a while. So if the essay were just "checks make agents more reliable," it would be a familiar and slightly boring claim. The interesting part is narrower, and it lives in the gap between what a compiler promises and what these checks actually deliver.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it feels like a compiler and not just a pile of linters
&lt;/h2&gt;

&lt;p&gt;Most lint suites are a pile. You accumulate rules, you run them, none of them knows about the others. The tell that this had become something more compiler-shaped was that I had started writing checks whose job was to police the &lt;em&gt;other&lt;/em&gt; checks: a script that reads every gate and verifies each one is registered, named to convention, and wired into the commit hook the same way. Rules about the rules, mechanically enforced.&lt;/p&gt;

&lt;p&gt;That meta-layer is the line between a lint pile and what a compiler-minded person would call static semantics. A pile says "these specific tripwires didn't fire today." A semantics says "the rules themselves are well-formed, so a whole class of malformed rule can't exist." Crossing that line is what makes the workflow &lt;em&gt;feel&lt;/em&gt; like it has a grammar rather than a checklist. It is also, as it turns out, exactly the feeling you should be most suspicious of.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three places the metaphor breaks
&lt;/h2&gt;

&lt;p&gt;A careful reader should distrust a flattering metaphor, so here is where this one falls apart. There are three breaks, and each one is load-bearing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;These are linters, not a type system.&lt;/strong&gt; A compiler makes a guarantee with a specific shape, the one Robin Milner named &lt;em&gt;soundness&lt;/em&gt;: if a program type-checks, an entire class of error is impossible — "well-typed programs cannot go wrong." My checks make a guarantee with a much weaker shape: these particular tripwires did not fire on this particular commit. A spelling denylist catches the words on the list and is blind to the synonym nobody listed. A check that greps a response shape proves the shape is present, not that the behavior behind it is right. The space between the invariant I care about and the grep that approximates it is precisely where the real bugs continue to live, and a green suite hides that space rather than illuminating it. I spent a whole companion essay, &lt;a href="https://vasyltretiakov.dev/p/gates-earned-from-failure" rel="noopener noreferrer"&gt;&lt;em&gt;Gates Earned From Failure&lt;/em&gt;&lt;/a&gt;, on the hazard that follows from this: a passing check converts active vigilance into misplaced trust. The compiler-specific version of the point is just that the word "compiler" smuggles in a soundness guarantee these checks have not earned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's accreted, not designed — there's no source grammar.&lt;/strong&gt; A real compiler starts from a language definition. Someone wrote down what a well-formed program is, and the checker enforces that document. My "well-formed workflow" was never written down. It is defined, entirely and only, by the union of whatever checks happen to exist on a given day, and each of those checks was born reactively from one incident that had already bitten me. So the de-facto language the agent is being held to is "whatever passes today," and the gaps in it stay invisible until the next incident reveals one. The always-loaded instruction file is the closest thing to a grammar I have, and it is a prose approximation, not a specification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The programmer and the language designer are the same agent.&lt;/strong&gt; This is the one that should worry you most. A real compiler is adversarially independent of the code it judges; the people who designed the language had never seen your program. In my setup the same agent writes the artifact and writes the gate that will judge the artifact's future versions. A gate can therefore faithfully encode the author's blind spot and then certify work as clean against it, forever. &lt;a href="https://vasyltretiakov.dev/p/rails-not-rules" rel="noopener noreferrer"&gt;Enforcing over discipline&lt;/a&gt; is a good rule with a hard ceiling: a gate only ever catches what someone already thought to mechanize. &lt;a href="https://www.infoq.com/podcasts/mcp-vibe-coding-harness-engineering/" rel="noopener noreferrer"&gt;Birgitta Böckeler&lt;/a&gt; makes the same observation one layer up, about tests, when she notes that the feedback is weaker than it looks once &lt;em&gt;"the agent also generated the tests."&lt;/em&gt; A verifier the generator authored is not independent of the thing it verifies.&lt;/p&gt;

&lt;h2&gt;
  
  
  The half that compiles the process
&lt;/h2&gt;

&lt;p&gt;Here is the part worth the essay. When I actually looked at my three dozen checks and asked what each one verifies, a large fraction of them were not checking the code at all. They were checking the &lt;em&gt;process&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;One enforces that a task needing a written specification can't be picked up in an implementation phase. One flags a task whose declared dependency has gone stale. One enforces the lifecycle of an amendment: proposed, reviewed, accepted, in that order, with no skipping. None of these reads a line of program logic. Each one encodes a piece of software-development-life-cycle (SDLC) discipline that, on a human team, lives in nobody's compiler and everybody's head. We call it professionalism, or being senior, or knowing how things are done here.&lt;/p&gt;

&lt;p&gt;Human teams almost never mechanize that layer, and the reason is economic, not cultural. People hold context across days and weeks, judgment tells them when a rule applies, and their name on the commit supplies accountability. Those three things are exactly what let a team write "we follow the amendment process" in an onboarding doc and then assume compliance. Building a script to enforce it would cost more than the drift it prevents on a small team that mostly remembers anyway.&lt;/p&gt;

&lt;p&gt;With an agent, that economic calculation inverts, hard. The generator has no continuity between sessions, so the drift rate is high. And the same agent will author the enforcing check for you in a few minutes, so the cost of mechanizing is near zero. When drift is frequent and gates are nearly free, you are rationally pulled to formalize methodology that was never worth formalizing in the entire prior history of software teams. That is the actual discovery hiding under the compiler feeling. Not that I added a spell-checker for my domain vocabulary, but that the economics of agent-directed work quietly made it worth compiling the development process itself into machine-checked constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  The instruction file stops being a manual
&lt;/h2&gt;

&lt;p&gt;The same hour I sent that "compiler" message, I followed it with a request that only makes sense if the metaphor is real:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;can we retire or compress or move (to HANDBOOK) some instructions… Errors and warnings already explain the error briefly and point to a HANDBOOK section with more details&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once a check owns a convention, the paragraph in the always-loaded file that used to beg the model to remember that convention is dead weight. The gate does the remembering. So you delete the prose and let the error message carry a pointer to a longer explanation that gets read only when the check actually fires. By the end of that session the project had settled into three clean layers: an always-loaded file that states directives, a handbook that explains mechanism, and the gates that enforce. The prose had stopped duplicating the type-checker.&lt;/p&gt;

&lt;p&gt;That is the compiler architecture made literal, and it reframes what the instruction file is for. It is not a manual the agent is trusted to follow. It is closer to a grammar: the small, stable statement of intent that the enforcement layer makes binding. Correctness moved out of the prose and into the checks, and the prose got shorter and more honest about its job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which invariants earn a type rule
&lt;/h2&gt;

&lt;p&gt;The skill this leaves you with is not "write more gates." A check family only ever grows, and a verifier that accepts everything is as useless as one that rejects everything. The real judgment is which invariants are cheap enough, sound enough, and quiet enough about false positives to deserve promotion into a type rule, and which should stay discipline. That judgment is a cost test, and I worked it out in detail in &lt;a href="https://vasyltretiakov.dev/p/gates-earned-from-failure" rel="noopener noreferrer"&gt;&lt;em&gt;Gates Earned From Failure&lt;/em&gt;&lt;/a&gt; rather than repeat it here: build the rule when the evidence the drift is real, times the cost of that drift, outweighs the standing cost of the check.&lt;/p&gt;

&lt;p&gt;The compiler framing adds one sharp reminder to that test. Over-strict typing is a real failure mode, not a safe default. The same morning all of this crystallized, I &lt;em&gt;dropped&lt;/em&gt; a gate I'd built, because it fired on too many legitimate cases to be worth its noise. In compiler terms that gate was refusing to compile sound programs, and a type system that rejects valid code is exactly as broken as one that admits invalid code. A rejected-but-correct commit teaches the same lesson a missed bug does: the rule was wrong. Calibration runs in both directions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;This is one person's experience with one codebase of roughly 150,000 lines, where I own the domain model and the workflow both, so the language designer and the programmer being the same agent is my daily reality rather than a thought experiment. That collapse is the project's sharpest limit, and I can't tell you I've escaped it. I can only tell you I try to audit what &lt;em&gt;isn't&lt;/em&gt; gated as deliberately as I build what is, because the suite measures the absence of known tripwires, never soundness.&lt;/p&gt;

&lt;p&gt;There is also no source grammar, and I'm not sure a healthy version of this ever gets one. The honest description of the system is not "a compiler for my workflow" but "an accreting pile of verifiers that has grown a thin layer of self-governance and started, in places, to behave like one." Whether that hardens into something with a real specification, or stays a well-tended pile, is a question I haven't answered. And the whole economic argument turns on a solo author hitting his own drift the same day he builds the gate. On a team the context cost compounds and the false-positive triage lands on people who never felt the original failure. I have evidence for the solo case and a suspicion the shape survives the move, which is not the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this generalizes
&lt;/h2&gt;

&lt;p&gt;Strip away my particular domain and the pattern is plain. Any agent working on a governed system accumulates drift faster than a human would, and will author the check that catches it for almost nothing. The first checks you reach for are the obvious ones, over the code and the vocabulary. The coupling-check pattern that catches a term renamed in three surfaces and missed in the fourth is one I've written about on its own, in &lt;a href="https://vasyltretiakov.dev/p/couple-both-ways" rel="noopener noreferrer"&gt;&lt;em&gt;Couple Both Ways&lt;/em&gt;&lt;/a&gt;. But the ones that surprised me, and the ones I'd point a skeptical reader toward, are the checks that compile the &lt;em&gt;process&lt;/em&gt; — the stage gates and lifecycle rules and staleness tripwires that encode how the work is supposed to flow. Human teams carry that in their heads because mechanizing it never paid. Directing an agent is the first context I've worked in where it does.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This essay was written by directing a coding agent over the project it describes; I direct and judge, the agent drafts and argues back. The reframing from "compiler" to "generator/verifier," and the three ways the metaphor breaks, came out of that argument — quoted here roughly as it happened.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're chewing on, I'm reachable on &lt;a href="https://www.linkedin.com/in/vasyl-tretiakov-b850231b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Vasyl Tretiakov, "&lt;a href="https://vasyltretiakov.dev/p/rails-not-rules" rel="noopener noreferrer"&gt;Rails, Not Rules: Enforcing a Coding Agent's Domain Vocabulary with Checks&lt;/a&gt;," vasyltretiakov.dev, 1 Jun 2026 — companion essay; the enforce-over-discipline origin.&lt;/li&gt;
&lt;li&gt;Vasyl Tretiakov, "&lt;a href="https://vasyltretiakov.dev/p/couple-both-ways" rel="noopener noreferrer"&gt;Couple Both Ways: bidirectional checks against silent drift&lt;/a&gt;," vasyltretiakov.dev, 9 Jun 2026 — companion essay; the coupling-check pattern in depth.&lt;/li&gt;
&lt;li&gt;Vasyl Tretiakov, "&lt;a href="https://vasyltretiakov.dev/p/gates-earned-from-failure" rel="noopener noreferrer"&gt;Gates Earned From Failure: a cost test for agent guardrails&lt;/a&gt;," vasyltretiakov.dev, 14 Jun 2026 — companion essay; the cost test and the green-checkmark hazard.&lt;/li&gt;
&lt;li&gt;Birgitta Böckeler, "&lt;a href="https://www.infoq.com/podcasts/mcp-vibe-coding-harness-engineering/" rel="noopener noreferrer"&gt;From MCP and Vibe Coding to Harness Engineering: How Did AI Native Engineering Evolve in One Year&lt;/a&gt;," InfoQ podcast, 8 Jun 2026 (accessed 14 Jun 2026). Source of "the agent also generated the tests."&lt;/li&gt;
&lt;li&gt;Robin Milner, "&lt;a href="https://www.sciencedirect.com/science/article/pii/0022000078900144" rel="noopener noreferrer"&gt;A Theory of Type Polymorphism in Programming&lt;/a&gt;," Journal of Computer and System Sciences 17(3):348–375, 1978 (accessed 17 Jun 2026). Origin of the type-soundness guarantee ("well-typed programs cannot go wrong") that breakage #1 says these checks have not earned.&lt;/li&gt;
&lt;li&gt;Eric Evans, &lt;em&gt;Domain-Driven Design: Tackling Complexity in the Heart of Software&lt;/em&gt;. Addison-Wesley, 2003. ISBN 978-0321125217. (The ubiquitous-language lineage the terminology checks enforce.)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Published at &lt;a href="https://vasyltretiakov.dev" rel="noopener noreferrer"&gt;vasyltretiakov.dev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>agents</category>
    </item>
    <item>
      <title>Gates Earned From Failure: a cost test for agent guardrails</title>
      <dc:creator>Vasyl Tretiakov</dc:creator>
      <pubDate>Sun, 14 Jun 2026 16:00:43 +0000</pubDate>
      <link>https://dev.to/vasyltretiakov/gates-earned-from-failure-a-cost-test-for-agent-guardrails-d74</link>
      <guid>https://dev.to/vasyltretiakov/gates-earned-from-failure-a-cost-test-for-agent-guardrails-d74</guid>
      <description>&lt;p&gt;&lt;em&gt;Every guardrail in my agent-built project was earned from a real failure, not designed up front. A cost test for when to build one, when to wait, and when to retire it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;On a Wednesday in late May I caught a bug by reading. The project's glossary — the canonical list of the domain terms my coding agent is required to use — had drifted from the domain model I actually carried in my head. Nothing flagged it. No test failed, no check fired, no compiler complained. I noticed because I happened to be reading the file and the words were wrong.&lt;/p&gt;

&lt;p&gt;What I typed next is the whole argument in one line. Not "fix the glossary," not "I'll be more careful," but a question aimed at the toolchain instead of the error:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the GLOSSARY already drifted away from my domain model vision and I would like to prevent this in future refactors&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That reframes the task. The drift stops being an error to correct once and becomes a signal about the toolchain: &lt;strong&gt;if I caught a class of drift by reading, an enforcement axis is missing.&lt;/strong&gt; The fix isn't to read harder next time. Reading harder is just minting another rule in your head, another wish. The fix is to add the check that couples the two things that drifted, and then never spend that attention again.&lt;/p&gt;

&lt;p&gt;I did not sit down and design a governance system, and that's the most transferable thing here. Every check in this project — a couple dozen now, wired as pre-commit hooks — was born from a particular drift I'd already been bitten by. The ordering is the point. The failure came first; the gate came second, shaped to the exact failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Renames stopped being discipline
&lt;/h2&gt;

&lt;p&gt;The clearest place this showed up was renaming. (I told the measurement side of this story in a companion piece, &lt;a href="https://vasyltretiakov.dev/p/rails-not-rules" rel="noopener noreferrer"&gt;&lt;em&gt;Rails, Not Rules&lt;/em&gt;&lt;/a&gt;: the first time I scripted a terminology check it found 737 violations in code I'd thought I was keeping clean. This is the other half — not the count, but the habit that grew from it.) Early on, retiring a domain term meant what it means everywhere: search, replace, move on. Then a retired term came back. Not because anyone re-typed it, but because a rename had swept the code and missed the surfaces a compiler never reads — a spec document, a generated config, a comment that mattered. Weeks later the old word resurfaced through a path I hadn't thought to look at.&lt;/p&gt;

&lt;p&gt;The detail that stung was where the residue hid. In one rename, the leftover instances of the old term weren't in some forgotten corner of the codebase. They were in the specification documents, the very files being rewritten to &lt;em&gt;drive&lt;/em&gt; the rename in the first place. The agent, under my direction, was editing the spec to declare the new vocabulary while leaving the old vocabulary scattered through the same spec. The document asserting the rename was itself out of compliance with it. You cannot out-discipline that, and watching the diff doesn't save you either: no amount of review reliably catches the contradiction being introduced in the very pass that's concentrating on the change. Only a check that reads the document back to you after the fact will, and that's what eventually went in.&lt;/p&gt;

&lt;p&gt;It helps to be clear about why renames recur at all, because it isn't sloppiness. The glossary is the written home of what Domain-Driven Design (DDD) calls the &lt;em&gt;ubiquitous language&lt;/em&gt; (UL): the single vocabulary that runs from a domain expert's sentence to a class name. Eric Evans is emphatic that this language is not written once and frozen; it is &lt;em&gt;continuously distilled&lt;/em&gt;, sharpened every time the team's understanding improves. Each sharpening is a rename. So the renames are the system working as designed, not a mess to stamp out — which is exactly why the residue problem is permanent and worth a gate rather than a resolution to be tidier.&lt;/p&gt;

&lt;p&gt;After that, a rename was never just "use the new word now." It became three things that ship together: rename the term, add a glossary entry recording the old word as &lt;em&gt;retired&lt;/em&gt;, and add a check keyed to that retired word so it can't silently return. The rename and the rail land in the same commit. Skip the rail and you're trusting that every future sweep will be perfect, which is the assumption that just failed.&lt;/p&gt;

&lt;p&gt;So far this is a tidy story: let each gate be earned by a real, observed drift, and your instruction file never bloats into the 200-line document nobody, human or model, can hold in their head. A rule you add from imagination is a guess about a failure that may never come. A rule you add from a failure you actually hit is paid for. It has a body behind it. The failure log is the design document.&lt;/p&gt;

&lt;p&gt;I believed that cleanly for about a month. Then I had to argue with my own agent about it, and the rule came out more interesting than it went in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The exceptions that earn a gate early
&lt;/h2&gt;

&lt;p&gt;The trouble with "wait for the drift" is that some drifts you cannot afford to observe even once. If the first occurrence is a deleted production table, or customer data in a log line, or a fabricated citation in something already published, "let it go wrong cheaply once" is a contradiction. There is no cheap once.&lt;/p&gt;

&lt;p&gt;I put this to the agent as a proposed amendment: build a gate proactively when the failure mode is already well-attested &lt;em&gt;and&lt;/em&gt; the check is cheap and low-false-positive, or when the first occurrence is expensive or irreversible. Stay reactive when the rule is judgment-heavy, the convention is still in flux, or the drift is cheap to catch and fix once. The reasoning is mundane risk math, not ideology: you pre-pay for a gate exactly when the expected cost of waiting exceeds the standing cost of the gate.&lt;/p&gt;

&lt;p&gt;This repo runs on that amendment, and I can point at two receipts. The writing project, the one these essays live in, has a check that blocks any cited URL not logged in an evidence file. I built it &lt;em&gt;before&lt;/em&gt; it had earned itself in the usual reactive way, because the failure it guards is the expensive kind and it had already happened once: a fabricated citation reached a draft. A published fabrication doesn't get a cheap first occurrence. So the gate went in proactively, licensed by the cost test: is the failure high enough cost, and is the check free of false positives?&lt;/p&gt;

&lt;p&gt;The second receipt is the other half of the amendment, the well-attested-and-cheap case rather than the irreversible one. The same writing project runs a prose check on these very drafts, flagging the small set of mechanical tells that mark machine-generated text. That failure mode wasn't hypothetical and it wasn't expensive-once; it was a known, recurring, deterministic pattern I'd have to correct on every essay forever. Well-attested, cheap to detect, low false-positive: that's the second carve-out, and it's the reason this paragraph can't lean on more than two em-dashes without the gate stopping the commit. The check is the essay's own argument applied to the essay. The rule "gates are earned" survives both times, with named carve-outs for failures you can't afford to rehearse and failures you've already seen enough times to name precisely.&lt;/p&gt;

&lt;h2&gt;
  
  
  When a rule won't gate, find its shadow
&lt;/h2&gt;

&lt;p&gt;Between "earn it from a failure" and "stay reactive" sits the move that does most of the real work, and it's the one I'd most want a reader to steal. A lot of rules look un-gateable because the thing they protect is a &lt;em&gt;meaning&lt;/em&gt;, and a script can't read meaning. "Don't restate the protocol prose in two places" isn't grep-able. Neither is "don't leave a dangling reference." The reflex is to give up and write the rule into an instructions file as a wish. The better move is to stop checking the rule and check a &lt;em&gt;structural proxy&lt;/em&gt; for it: a syntactic shadow the real rule casts, one a script can see and that almost never flags an innocent.&lt;/p&gt;

&lt;p&gt;It works more often than it has any right to. "Don't duplicate the protocol in prose" became "no narrative text in column zero of these files," because the protocol lived in indented blocks and unindented prose was the tell. "No dangling references" became "every &lt;code&gt;TODO(&lt;/code&gt; carries a live task slug." The em-dash rule guarding these very drafts is the same trick: I can't gate "don't write like a language model," but I can gate "no paragraph carries more than two em-dashes," which is a measurable shadow of the thing I actually want. Each time, a rule I'd have sworn was judgment-only turned out to cast a shape a grep could catch.&lt;/p&gt;

&lt;p&gt;The proxy is never the rule, though. A structural shadow is an approximation by construction, so every proxy gate ships with a built-in gap between what it checks and what you wish it proved. That gap is the next failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure that hides inside a green checkmark
&lt;/h2&gt;

&lt;p&gt;Here is the part I got wrong, and the agent caught.&lt;/p&gt;

&lt;p&gt;I had been treating "cheap to write" as most of what makes a gate worth building. The agent pushed back on two fronts. First, cheap has to mean cheap to &lt;em&gt;maintain&lt;/em&gt;, not cheap to write. A gate is a standing liability, not a one-time cost. Every legitimate evolution of the convention now has to route through the check, and you pay an update tax and a false-positive-triage tax for the life of the gate. The authoring cost is the small number.&lt;/p&gt;

&lt;p&gt;The second point is the one that changed how I think. The agent put it more sharply than I had:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Once a green gate exists, people stop eyeballing the thing the gate appears to cover. So a cheap-but-approximate gate over a real invariant can be worse than no gate — it converts active vigilance into misplaced trust.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The dominant hidden failure of a cheap gate isn't noise. It's false confidence. A cheap check usually only &lt;em&gt;approximates&lt;/em&gt; the invariant you care about. A text search proves a token is absent; it does not prove a concept was actually renamed everywhere it lives. A fast compile passes against a stale cache. A sub-agent reports "tests passed" and you don't re-run them. In every case the green checkmark covers less than it appears to.&lt;/p&gt;

&lt;p&gt;An outside read lands in the same place. Birgitta Böckeler, whose harness vocabulary I lean on below, notes that test feedback is weaker than it looks once &lt;em&gt;"the agent also generated the tests."&lt;/em&gt; A verifier the generator authored is not independent of the thing it checks, so a passing suite certifies less than it seems to — the same gap, one layer up.&lt;/p&gt;

&lt;p&gt;The rename gate from earlier is a perfect specimen — a structural proxy of exactly the kind I just praised, which makes it the more humbling, because it's one I'm proud of. It greps for the retired word and passes when the word is gone. But "the old word is absent" and "the concept was correctly migrated" are not the same claim. You can delete every instance of the old term and still have left the &lt;em&gt;idea&lt;/em&gt; it named half-translated, split across two new words that should have been one, or attached to the wrong entity. The gate is green and the migration is wrong, and now nobody's reading the diff with the old suspicion, because the check says it's handled. The check is real and worth having. It just proves a narrower thing than its green checkmark advertises, and the discipline is to keep knowing the difference instead of letting the green absolve you.&lt;/p&gt;

&lt;p&gt;That's the trap in one move. You were watching the thing carefully when nothing claimed to watch it for you. Now something claims to, so you look away, and the gap between &lt;em&gt;what the check proves&lt;/em&gt; and &lt;em&gt;what it looks like it proves&lt;/em&gt; is exactly where the next bug lives. When those two diverge and the invariant matters, that argues against the cheap gate, not for it.&lt;/p&gt;

&lt;p&gt;This isn't an argument against gates. It's an argument for knowing which of your green checkmarks are load-bearing and which are decorative.&lt;/p&gt;

&lt;h2&gt;
  
  
  It's a ladder, not a switch
&lt;/h2&gt;

&lt;p&gt;The other thing I had been collapsing was the choice itself. I'd been treating it as binary: gate the thing, or stay reactive and fix it by hand when it breaks. There are at least four rungs, and most failures want one of the middle two.&lt;/p&gt;

&lt;p&gt;The distinction isn't mine — manufacturing got there decades ago. Shigeo Shingo's &lt;a href="https://en.wikipedia.org/wiki/Poka-yoke" rel="noopener noreferrer"&gt;poka-yoke&lt;/a&gt;, the mistake-proofing of the Toyota Production System, already splits a &lt;em&gt;control&lt;/em&gt; that makes the error impossible from a &lt;em&gt;warning&lt;/em&gt; that only signals it. The two middle rungs below are those two ideas wearing software clothes. Birgitta Böckeler's &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;harness engineering&lt;/a&gt; gives the orthogonal axis: &lt;em&gt;guides&lt;/em&gt; that steer the agent before it acts, &lt;em&gt;sensors&lt;/em&gt; that observe after. A blocking gate is a sensor with teeth; a default path is a guide that leaves nothing to sense.&lt;/p&gt;

&lt;p&gt;The strongest rung isn't a gate at all. It's the &lt;strong&gt;default path&lt;/strong&gt; — Shingo's control type: make the wrong artifact hard to author in the first place. A template, a generator, a structured section that only has slots for the right shape. There's nothing to false-positive on, because you're not checking after the fact, you're removing the way to get it wrong. When a failure is common but a precise gate would be noisy, this beats the gate.&lt;/p&gt;

&lt;p&gt;Below the blocking gate sits the &lt;strong&gt;tripwire&lt;/strong&gt; — Shingo's warning type: warn without blocking, or ask for confirmation before an irreversible step. This is the right reach for failures that are catastrophic on first occurrence but genuinely hard to gate cleanly — the data deletion, the leak, the fabrication. You don't need a perfect detector. A loud "are you sure, here's what you're about to overwrite" buys most of the protection at a fraction of the false-positive cost. The mistake is letting "we can't build a clean gate for this" collapse into "so we stay fully reactive" on a failure you can't take twice.&lt;/p&gt;

&lt;p&gt;The rung that doesn't work is the one that looks like discipline: a script everyone is &lt;em&gt;supposed&lt;/em&gt; to run but nothing enforces. It's neither a gate nor a default path, so under any real deadline it just gets skipped. A rule that lives only in prose is a preference, however firmly phrased.&lt;/p&gt;

&lt;p&gt;So the ladder, top to bottom: blocking gate, default path, tripwire, stay reactive. "Should I gate this" was always the wrong question. The question is which rung the failure earns.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost test cuts both ways
&lt;/h2&gt;

&lt;p&gt;Because a gate is a standing liability, "earned" is a real filter, and a filter that only ever says yes isn't filtering. The same cost test that licenses a proactive gate also tells you when to leave something ungated, and I find I reach for it in that direction about as often.&lt;/p&gt;

&lt;p&gt;A concrete one: I recently changed a documentation convention across the project. A pure convention change, no behavior, no domain term retired. The reflex in a repo that leans this hard on enforcement is to add a check for the new convention. What I wrote instead was an instruction to &lt;em&gt;not&lt;/em&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Don't add a gate or script — this is a convention change only; the enforcement hook stays deferred per the cost test.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The convention isn't load-bearing, a check on it would generate churn every time the convention itself evolves, and the failure if someone gets it wrong is cosmetic. No gate. Not yet, maybe not ever.&lt;/p&gt;

&lt;p&gt;The same logic runs in reverse, against gates that already exist. A check family only grows: one iteration of mine added six at once, and nothing in the workflow ever pushes the other way. The tidy instinct is a recurring "prune the gates" ritual at every close. That instinct is right about the pressure and wrong about the mechanism, for two reasons that separate a gate from a stale paragraph of prose. A doc paragraph taxes every session, so doc-bloat is pure, constant rent; a redundant gate taxes only when it runs, so its bloat is mostly latent. And a stale doc is just noise, while a gate is a load-bearing invariant whose removal risks silently dropping a correctness check no one notices is gone. So pruning earns more care and less frequency than tidying prose, not the same recurring slot. A full portfolio review every close would usually find nothing, and a ritual that usually finds nothing is the skipped-discipline trap again, dressed as housekeeping.&lt;/p&gt;

&lt;p&gt;Two corollaries fall out. "Merge these two similar gates" defaults to &lt;em&gt;no&lt;/em&gt;: two precise checks beat one fuzzy merged one, and a merge is a design act with its own failure mode, not janitorial cleanup. And the counter-pressure that &lt;em&gt;is&lt;/em&gt; warranted should itself be mechanized rather than left to willpower — a redundancy meta-check that reads the coupling each gate already declares in its header and flags real overlap, fired on a threshold or a phase boundary, never once per commit. Even pruning gates wants a rung, not a resolution to be tidier.&lt;/p&gt;

&lt;p&gt;A good gate even pays a rent rebate. Every rule you load into the agent's always-present instructions is standing context it has to carry, and that context isn't free. A deterministic check lets you &lt;em&gt;delete&lt;/em&gt; the paragraph of prose that used to beg the model to remember a rule and replace it with a script plus a reference read only when it fires. The gate does the remembering, so the context window doesn't have to — fewer gates can mean a heavier prompt, not a lighter one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule about gates is itself a gate
&lt;/h2&gt;

&lt;p&gt;The recursion is the part I like best: the system diagnosed the bias on itself.&lt;/p&gt;

&lt;p&gt;Everything in this project leans toward enforcement over discipline. The agent named the hazard better than I had:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the whole &lt;code&gt;check-*&lt;/code&gt; script family leans hard toward enforce-over-discipline, so a fresh session inherits a "when in doubt, add a check" texture with no stated boundary. That absence is a latent gap.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A bias toward gating with nothing written down about when &lt;em&gt;not&lt;/em&gt; to. The honest response, by this essay's own logic, is not to immediately write the boundary rule into the always-loaded instructions. That would be exactly the proactive over-gating the rule warns against. The boundary rule hasn't failed yet.&lt;/p&gt;

&lt;p&gt;So we left it ungated, with a concrete trigger: the first time any session ships a gate that turns out high-false-positive or merely approximate, &lt;em&gt;that's&lt;/em&gt; the observed failure, and the boundary rule gets promoted then — and into a read-on-demand reference, not the always-loaded file, because it's judgment-grade material rather than a per-session directive. The meta-rule about earning gates is itself being made to earn its place. The failure log is the design document, all the way up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;This worked for one person, on one roughly 150,000-line codebase, where I own the glossary and the domain model is a single coherent thing in a single head. "Earned from failure" is cheap when the failure is yours and you hit it the same day you build the gate. At team scale the picture is harder: the standing context cost compounds, the false-positive triage falls on people who didn't feel the original pain, and "let it go wrong cheaply once" is a different proposition when the once is someone else's afternoon. I don't have evidence for the team case. I have evidence for the solo case and a suspicion the principles survive the move, which is not the same thing.&lt;/p&gt;

&lt;p&gt;One tempting claim I'll resist. I had a hunch, watching the sessions, that the agent &lt;em&gt;resists&lt;/em&gt; proposing gates proactively — that it fixes the immediate issue and only designs the enforcement when pushed. There's a clean-looking receipt for it. Asked to prevent a recurrence after I'd approved a fix, the agent replied:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Approved. Let me make the fix first, then design the enforcement check.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Fix first, gate second, only once prompted. But when I went looking for the pattern, the support was mixed. In the flow of a task it does tend to fix first and gate when asked. Hand it the same question as a matter of policy, though, and it argues for &lt;em&gt;more&lt;/em&gt; proaction than I'd proposed, not less. So I won't dress that up as a clean behavioral finding. It's a real tension I don't yet understand well enough to assert.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this generalizes
&lt;/h2&gt;

&lt;p&gt;None of this is specific to my domain. Any agent operating on a governed system — a customer-service platform, a billing pipeline, anything where words have to mean the same thing across many surfaces — accumulates the same class of drift, and the same four rungs apply.&lt;/p&gt;

&lt;p&gt;Picture the Contact Center case concretely, since it's the domain I know best. A single concept gets named in a routing rule, a reporting dashboard, an agent-facing script, and the schema of the system that ties them together. Rename it in three of those four and the fourth keeps quietly emitting the old word, so the dashboard and the router disagree about what they're counting, and nothing crashes. That's the drift, and it's invisible to every test that checks behavior rather than vocabulary. The gate that couples the surfaces is the only thing that catches a disagreement no single component is wrong about. That coupling pattern, applied in both directions, is the subject of a companion essay, &lt;a href="https://vasyltretiakov.dev/p/couple-both-ways" rel="noopener noreferrer"&gt;&lt;em&gt;Couple Both Ways&lt;/em&gt;&lt;/a&gt;. An agent makes this worse, not better, because it will cheerfully and fluently use whichever term it last saw, in whichever surface it's editing.&lt;/p&gt;

&lt;p&gt;The transferable move is to stop treating your instruction file as the place where correctness lives. Prose is where preferences live. Correctness lives in the gates, default paths, and tripwires you build, each one shaped to a failure specific enough to point at. You don't govern an agent by predicting how it'll go wrong. You let it go wrong cheaply where you can, and you convert each real miss into the cheapest rung that makes it impossible to miss that way again.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This essay was written by directing a coding agent over the project it describes; I direct and judge, the agent drafts and argues back. The argument-back, in this case, is most of beats four and five.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're chewing on, I'm reachable on &lt;a href="https://www.linkedin.com/in/vasyl-tretiakov-b850231b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Vasyl Tretiakov, "&lt;a href="https://vasyltretiakov.dev/p/rails-not-rules" rel="noopener noreferrer"&gt;Rails, Not Rules: Enforcing a Coding Agent's Domain Vocabulary with Checks&lt;/a&gt;," vasyltretiakov.dev, 1 Jun 2026 — companion essay.&lt;/li&gt;
&lt;li&gt;Vasyl Tretiakov, "&lt;a href="https://vasyltretiakov.dev/p/couple-both-ways" rel="noopener noreferrer"&gt;Couple Both Ways: bidirectional checks against silent drift&lt;/a&gt;," vasyltretiakov.dev, 9 Jun 2026 — companion essay (the coupling pattern in depth).&lt;/li&gt;
&lt;li&gt;Eric Evans, &lt;em&gt;Domain-Driven Design: Tackling Complexity in the Heart of Software&lt;/em&gt;. Addison-Wesley, 2003. ISBN 978-0321125217. (Ubiquitous language and its continuous distillation.)&lt;/li&gt;
&lt;li&gt;Birgitta Böckeler, "&lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;Harness engineering for coding agent users&lt;/a&gt;," martinfowler.com, 2 Apr 2026 (accessed 9 Jun 2026).&lt;/li&gt;
&lt;li&gt;Birgitta Böckeler, "&lt;a href="https://www.infoq.com/podcasts/mcp-vibe-coding-harness-engineering/" rel="noopener noreferrer"&gt;From MCP and Vibe Coding to Harness Engineering: How Did AI Native Engineering Evolve in One Year&lt;/a&gt;," InfoQ podcast, 8 Jun 2026 (accessed 14 Jun 2026). Source of the "the agent also generated the tests" point.&lt;/li&gt;
&lt;li&gt;"&lt;a href="https://en.wikipedia.org/wiki/Poka-yoke" rel="noopener noreferrer"&gt;Poka-yoke&lt;/a&gt;," Wikipedia (accessed 9 Jun 2026). Shigeo Shingo's mistake-proofing; the control-vs-warning split.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Published at &lt;a href="https://vasyltretiakov.dev" rel="noopener noreferrer"&gt;vasyltretiakov.dev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Couple Both Ways: bidirectional checks against silent drift</title>
      <dc:creator>Vasyl Tretiakov</dc:creator>
      <pubDate>Tue, 09 Jun 2026 16:50:23 +0000</pubDate>
      <link>https://dev.to/vasyltretiakov/couple-both-ways-bidirectional-checks-against-silent-drift-3onj</link>
      <guid>https://dev.to/vasyltretiakov/couple-both-ways-bidirectional-checks-against-silent-drift-3onj</guid>
      <description>&lt;p&gt;&lt;em&gt;The coupling-check pattern: catching the drift between any two artifacts that are supposed to agree.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A class diagram in my project claimed that two repository finders were each backed by a remote procedure call (RPC). Neither call existed. The diagram had been generated, reviewed, and committed, and it sailed through every check I had aimed at my diagrams, of which there were already three. It described a system slightly richer than the one I had actually built, and nothing flagged the difference.&lt;/p&gt;

&lt;p&gt;It passed because each of those three checks looked at a different axis. One validated that every type named in the diagram resolved to a defined class. One read the architecture diagram's sequence operations. One forced a specific glyph to carry a citation. Each was a perfectly good check. But none of them coupled &lt;em&gt;this&lt;/em&gt; claim, a finder asserting a backing call, to the place that claim was supposed to resolve: the actual service definition in the component's wire contract. The drift wasn't inside any one check. It lived in the seam between them, in the one direction nobody had wired up. The fix was a fourth check that closes the seam: every finder in the diagram must name a backing RPC that resolves to a real declaration in that component's protobuf, qualified by component so a pointer to the wrong service still fails, or else be tagged explicitly as proposed-but-unbuilt work. Run against the existing diagram, it found the two finders pointing at nothing. They had been asserting un-built calls, quietly, through every prior gate.&lt;/p&gt;

&lt;p&gt;I've come to think that small episode is the most transferable thing in the project, more so than the terminology linter it grew out of, which I wrote up in a companion piece, &lt;a href="https://vasyltretiakov.dev/p/rails-not-rules" rel="noopener noreferrer"&gt;&lt;em&gt;Rails, Not Rules&lt;/em&gt;&lt;/a&gt;. If the story stopped at "I wrote a checker for my diagrams," it would be a tip. What made it generalize was noticing that the check was an instance of a &lt;em&gt;shape&lt;/em&gt;, and the shape was reusable almost everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape
&lt;/h2&gt;

&lt;p&gt;Call it &lt;strong&gt;bidirectional manifest coupling&lt;/strong&gt;. By &lt;em&gt;manifest&lt;/em&gt; I mean any artifact that declares an intent some other artifact is supposed to honor: a glossary entry, a spec section, an edge in an architecture diagram, a service in a wire contract, an item in a work queue, even a document's own table of contents. The defining trait is that a manifest makes a &lt;em&gt;claim about something else&lt;/em&gt;. The glossary claims the code uses these words. The diagram claims these components talk over these protocols. The finder claims a call backs it.&lt;/p&gt;

&lt;p&gt;A coupling check picks two artifacts that are supposed to stay in agreement, and that drift apart silently because nothing connects them, and it fails when they disagree, in &lt;em&gt;both&lt;/em&gt; directions. Often one endpoint is the manifest and the other is plain reality: the on-disk code, the filesystem, the set of services that actually exist. Just as often both endpoints are documents, and the check couples one declaration to another. The glossary and the code are one pair. The diagram and the spec are another. Once you have the move in hand, the pairs are everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why both directions
&lt;/h2&gt;

&lt;p&gt;Bidirectionality is the load-bearing word, and it's exactly what the opening story was missing. A one-way check lets the drift relocate to the direction you didn't cover. "Every component named in the diagram exists on disk" is a fine rule, and it would have caught a diagram that referenced a deleted module. It would have done nothing about my two phantom finders, because that drift ran the other way: the diagram asserted a call the code didn't provide. Coverage in one direction quietly grants a license to drift in the other.&lt;/p&gt;

&lt;p&gt;So the cleanest checks in the project assert closure both ways at once. The architecture-coverage check is the tidy example. The diagram carries an explicit manifest of its components and edges, and the script asserts that the component set matches the on-disk components exactly, every listed component has a spec and every spec'd component is listed, and that every edge names endpoints the manifest knows about, with the wire-protocol edges further required to resolve to a real service declaration. When I built it, twenty-two components and forty-two edges agreed; a fault-injection test confirmed it goes red on a bogus edge. There is nowhere for a disagreement to hide, because both the presence and the absence of a thing are checked from both sides. That is the actual goal of the whole exercise. Not preventing the agent from ever being wrong, which is hopeless, but guaranteeing that when it is wrong, something turns red before I have to notice by eye.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first run is a free audit
&lt;/h2&gt;

&lt;p&gt;Building the check is also the first time you find out whether the two artifacts agreed in the first place. Authoring a coupling gate quietly assumes the surfaces it joins are already consistent, and they usually aren't; that latent disagreement is most of why the gate was worth writing. I added a membership check between two restated state-transition tables only after clearing them by eye, reading both and judging them consistent. Its first run failed anyway: three other surfaces agreed that one state could transition to a terminal one, and the table I'd eyeballed as canonical was the single place that had silently dropped the edge. The by-eye clearance slid right over a missing row; the mechanical set-difference did not. So I treat a new coupling check's first-run failures as findings, not as setup noise to quiet down, and I budget the time to chase them. The check earns its keep twice: once as the backfill audit that runs the day you write it, and again as the guard that holds the line afterward.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pairs are everywhere
&lt;/h2&gt;

&lt;p&gt;Once the shape is in hand it stops being a clever trick and becomes the default move. The commit where I mechanized a completely different pair names it outright: coupling a document's section index to its sections, it describes itself as "the same manifest-coupling pattern as the other check scripts." By then I wasn't inventing anything. I was reaching for a stamp I already owned.&lt;/p&gt;

&lt;p&gt;The pairs accumulated fast. The glossary and the code, the original pair, where a retired term must not reappear anywhere the compiler can't see. An architecture diagram and the spec it claims to depict. A class diagram's types and their definitions. The wire contract and the spec that documents its fields, checked for field-level congruence so a renamed field can't drift between the two. A &lt;code&gt;TODO&lt;/code&gt; left in the code and the entry that's supposed to track it in the work queue. The tool surface an agent is allowed to call and the underlying RPCs, checked for symmetry so neither grows an orphan. A document's table of contents and the document. Each pair gets one check that says: &lt;em&gt;a claim on this side must resolve to something on that side, and the reverse.&lt;/em&gt; None of these is clever in isolation. The leverage is in recognizing that they are all the same check wearing different nouns.&lt;/p&gt;

&lt;h2&gt;
  
  
  The discipline that keeps it honest
&lt;/h2&gt;

&lt;p&gt;Three things stop this from degrading into a wall of red noise, and all three are worth more than the pattern itself.&lt;/p&gt;

&lt;p&gt;The first is that a blocking check has to be false-positive-free, and the reliable way to get there is a single unambiguous syntactic trigger plus opt-in for everything else. In the behavioral-coupling check, the trigger is one diagram glyph that marks a claim as "derived." A claim wearing that glyph &lt;em&gt;must&lt;/em&gt; carry a backing citation to a spec section or a proposed task; any other claim may opt in to the same discipline but isn't forced to. Because the glyph is the one thing that fires the rule, the check cannot misfire on prose it was never meant to read. This is the same constraint I'd put behind every gate I trust: if you can't define the trigger crisply enough to be false-positive-free, the check will get bypassed, and a bypassed gate is worse than none. A coupling check that flags honest work teaches everyone to ignore it.&lt;/p&gt;

&lt;p&gt;The second is an escape hatch for work that is legitimately ahead of itself. A check that demands two artifacts agree exactly will go red the moment you deliberately let one lead the other, and design-ahead-of-code is a normal, healthy state: a diagram or a spec describing where the system is going before the code catches up. A strict bijection with no relief for that would fire on every honest in-progress diff, which is the red-noise failure arriving by a second route. So each coupling carries a release valve. A claim that hasn't landed yet gets tagged as proposed, with the tag naming the open work item that will close the gap, and the check accepts the divergence as long as the tag is there. That turns a hard "these must match" into "these must match, or the gap is tracked and intentional." The work queue becomes the ledger of what is allowed to be out of sync and why; when the tagged work lands and the tag comes off, the check snaps back to demanding exact agreement and catches anything that merged still mismatched. A bijection you can't ever postpone isn't enforceable on a moving project. A bijection with a tracked exception is.&lt;/p&gt;

&lt;p&gt;The third is knowing precisely what the check does and doesn't prove, and the honest answer is narrower than it looks. A coupling check verifies &lt;em&gt;resolution&lt;/em&gt;, not &lt;em&gt;correctness&lt;/em&gt;. It confirms that a claim points at something real, and it breaks the instant that something is renamed or deleted. It does not, and cannot, tell you the claim points at the &lt;em&gt;right&lt;/em&gt; thing. The check's own documentation says this in as many words: the forced citation breaks on rename or removal but does not detect semantic contradiction. My finder check proves each finder names a call that exists; it has no opinion on whether that's the call the finder ought to be using. That sounds like a weakness, and it is a real limit, but it's also most of the value. The overwhelming majority of drift in a fast-moving, agent-built codebase is exactly this: a referent that quietly stopped existing, a name that moved, a manifest describing a system one refactor out of date. Catching all of &lt;em&gt;that&lt;/em&gt; mechanically leaves your actual judgment for the question a script was never going to answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prior art, and where this sits
&lt;/h2&gt;

&lt;p&gt;Coupling code to a declared rule is old and well-trodden. ArchUnit lets you write tests that assert layering and dependency direction against compiled Java; dependency-cruiser does the equivalent for JavaScript and TypeScript; the whole tradition of &lt;em&gt;fitness functions&lt;/em&gt;, from Ford, Parsons, and Kua's &lt;em&gt;Building Evolutionary Architectures&lt;/em&gt; (O'Reilly, 2017), is about automated checks that protect an architectural property as a system evolves. In Birgitta Böckeler's harness-engineering vocabulary, every one of these, mine included, is a &lt;em&gt;sensor&lt;/em&gt;: something that observes after the agent acts and lets the work self-correct. I'm not claiming the mechanism.&lt;/p&gt;

&lt;p&gt;What I'd add is a direction and a generalization. The established tools aim almost entirely at architecture, couple &lt;em&gt;code&lt;/em&gt; to a &lt;em&gt;single&lt;/em&gt; rule set, and mostly run one way: the code must obey the manifest. The shape turns out to be more general than that. The second endpoint is frequently not code at all but another document, and the artifacts most prone to silent drift in an agent-built project are precisely the ones the compiler never opens: the diagrams, the glossary that carries the Domain-Driven Design (DDD) ubiquitous language, the wire contract, the work queue, a handbook's index. Those are where an agent's output and its own description of that output come apart, because nothing downstream forces them back together. Pointing the same bidirectional coupling at &lt;em&gt;those&lt;/em&gt; pairs, and insisting it run in both directions, is the part I had to work out for myself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limits
&lt;/h2&gt;

&lt;p&gt;This is one solid personal project, roughly 150,000 lines of Rust across thirty-odd crates and twenty-two components, with one person owning every artifact end to end. Treat it as a worked example, not a benchmark. The economics that make these checks cheap, a single mind deciding what each manifest means, are exactly what I haven't tested under shared ownership.&lt;/p&gt;

&lt;p&gt;The pattern also has real edges. The annotations that make a check resolvable, the citation on a diagram glyph, the backing tag on a finder, are themselves a surface that can rot, and they cost something to keep current. False-positive-freeness depends on a clean syntactic trigger existing; where no such trigger is available, you fall back to opt-in or to plain discipline, with all the looseness that implies. And the resolution-not-correctness limit is permanent, not a bug to be fixed later: these checks guarantee your manifests describe a system that exists, never that it's the system you should have built. What they buy back is the attention you'd otherwise spend chasing stale references by eye, and in a codebase that an agent is reshaping several times a week, that turns out to be most of the attention you have.&lt;/p&gt;

&lt;p&gt;There is a subtler failure than red noise, and it's the one I'd warn a reader about hardest: a coupling check is only as good as the channel it actually joins. I once wrote a check coupling alerting rules to the code that emits the metrics they name, and it passed clean: every rule named a real metric. But those values traveled two pipes that happened to share one namespace, and five rules named metrics emitted only on the pipe the alert engine never reads. The strings matched perfectly while the feature underneath was dead. The check was green over a coupling that didn't physically connect. Picking the hub, the specific producer the consumer truly reads from rather than every place the string appears, is its own design step, and getting it wrong buys you false confidence, which is worse than the false noise the discipline above is built to avoid.&lt;/p&gt;

&lt;p&gt;I build domain-dense systems by directing coding agents, where the vocabulary and the contracts have to stay honest as the code moves under them. If that's a seam you're working too, I'm reachable on &lt;a href="https://www.linkedin.com/in/vasyl-tretiakov-b850231b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Written by directing an AI agent, the same way the platform it describes was built. The editing and the judgment are mine.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Vasyl Tretiakov, "&lt;a href="https://vasyltretiakov.dev/p/rails-not-rules" rel="noopener noreferrer"&gt;Rails, Not Rules: Enforcing a Coding Agent's Domain Vocabulary with Checks&lt;/a&gt;,"
vasyltretiakov.dev, 1 Jun 2026 — companion essay.&lt;/li&gt;
&lt;li&gt;Neal Ford, Rebecca Parsons, and Patrick Kua, &lt;em&gt;Building Evolutionary
Architectures: Support Constant Change&lt;/em&gt;. O'Reilly, 2017. ISBN 978-1491986356.&lt;/li&gt;
&lt;li&gt;Birgitta Böckeler, "&lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;Harness engineering for coding agent users&lt;/a&gt;,"
martinfowler.com, 2 Apr 2026 (accessed 9 Jun 2026).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.archunit.org/" rel="noopener noreferrer"&gt;ArchUnit&lt;/a&gt; — architecture unit-testing for Java
(accessed 9 Jun 2026).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sverweij/dependency-cruiser" rel="noopener noreferrer"&gt;dependency-cruiser&lt;/a&gt; — validate
and visualize dependencies for JavaScript and TypeScript (accessed 9 Jun 2026).&lt;/li&gt;
&lt;li&gt;Eric Evans, &lt;em&gt;Domain-Driven Design: Tackling Complexity in the Heart of Software&lt;/em&gt;.
Addison-Wesley, 2003. ISBN 978-0321125217.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Published at &lt;a href="https://vasyltretiakov.dev" rel="noopener noreferrer"&gt;vasyltretiakov.dev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Rails, Not Rules: enforcing a coding agent's domain vocabulary with checks</title>
      <dc:creator>Vasyl Tretiakov</dc:creator>
      <pubDate>Mon, 01 Jun 2026 12:28:44 +0000</pubDate>
      <link>https://dev.to/vasyltretiakov/rails-not-rules-enforcing-a-coding-agents-domain-vocabulary-with-checks-3ea4</link>
      <guid>https://dev.to/vasyltretiakov/rails-not-rules-enforcing-a-coding-agents-domain-vocabulary-with-checks-3ea4</guid>
      <description>&lt;p&gt;&lt;em&gt;Why I stopped telling my coding agent the domain language and started enforcing it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I wrote my coding agent's rules for the project's terminology as prose on April 24. I wrote a script to &lt;em&gt;check&lt;/em&gt; those rules on May 2. The first run found 737 violations, in a codebase of roughly 150,000 lines that the agent had written under my direction and that I had, ostensibly, been keeping consistent the whole way.&lt;/p&gt;

&lt;p&gt;It's worth being precise about what those 737 were, because the easy reading (&lt;em&gt;the agent ignored my rules&lt;/em&gt;) is wrong, and the real story is more useful. I'd adopted a spec-driven way of working. When the domain language needed sharpening, a rename began in the glossary, with the old term retired and the new one defined; from there it flowed into spec amendments, and only then reached the code. Each rename was supposed to be total. None of them was. A rename propagates cleanly through the parts a compiler can see, and leaves a residue everywhere it can't: a directory still named after the old concept, a string in the frontend, a term in a comment, and often enough in the specs themselves, even though the specs are far smaller than the code and were being actively rewritten as part of the rename. I kept catching that residue by eye, one piece at a time, with the uneasy sense that I was only seeing a fraction of it. The 737 was the first time I built something that could count the rest. The rules in the file weren't wrong, and weren't being defied. They just couldn't &lt;em&gt;see&lt;/em&gt; the artifacts they governed: not the code, and not even the specs being rewritten right alongside it. A sentence in a markdown file is a wish. The check was the first thing in the project that could tell me how far the wish was from true.&lt;/p&gt;

&lt;p&gt;This essay is about the move that number forced: from writing the domain language down to enforcing it mechanically. Not as a productivity hack, but as the thing that turned out to actually keep a fast-moving, agent-built codebase consistent with its own vocabulary. If you've written a careful instructions file for an AI coding tool and watched the code drift out from under it anyway, this is the same story with a receipt attached, and a more useful diagnosis than "the model won't listen."&lt;/p&gt;

&lt;h2&gt;
  
  
  Prose doesn't bind
&lt;/h2&gt;

&lt;p&gt;There's a familiar version of this. &lt;em&gt;I wrote 200 lines of rules and the model ignored them all.&lt;/em&gt; The usual explanations are context-window pressure, instruction-following limits, the rules competing with everything else in the prompt. Probably all true. But framing it as the model misbehaving misses the more general mechanism, which has nothing to do with whether the agent is trying to comply. Even a perfectly diligent collaborator, human or model, can't apply a rule to a surface they don't know has changed. The rule's presence in the prompt and the rule's effect on the codebase are two different facts, and the gap between them stays invisible exactly where nothing measures it.&lt;/p&gt;

&lt;p&gt;The most striking version of this I have on record came from the agent itself. Mid-project, after one more written reminder had failed to stick, I asked it point-blank whether putting the rule into its persistent instructions would be enough. Its answer argued against its own convenience:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Honest answer: no. … it's just discipline with extra steps … it only works if I read it and then &lt;em&gt;choose&lt;/em&gt; correctly in the moment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the whole thesis, volunteered by the system the thesis is about. A written instruction it is free to not act on is not a constraint, however firmly phrased. It's a preference. The fix can't be a better-worded preference; it has to be something that doesn't depend on the agent choosing right in the moment.&lt;/p&gt;

&lt;p&gt;The retrospective receipt arrived the same week. I had the agent split its own instructions file in two and mechanize the rules that had been pure prose. The commit message (written, like nearly all of them in this project, by the agent reflecting on its own behavior) says why in five words I keep coming back to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;misses clustered on un-gated prose rules.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When it mapped where the codebase had actually fallen out of step with its own rules, the misses weren't randomly distributed. They clustered on exactly the rules that had nothing behind them but having been written down. The rules that &lt;em&gt;did&lt;/em&gt; have a check, a script or a hook or anything that could fail a run, were not where the drift lived. The pattern has nothing to do with defiance. It's about consequence. A rule with no mechanical consequence is indistinguishable, from inside the work, from no rule at all.&lt;/p&gt;

&lt;p&gt;A single episode made this concrete. The instructions stated, in plain language, that a rename must sweep every surface in lockstep: old name retired everywhere, no stragglers (the agent's own phrasing). It's hard to write a clearer rule. And a rename still went out leaving a batch of sites unswept, not only in the stringy corners of the code a compiler never reads, but in the specs themselves, the very documents being rewritten to drive the rename. What caught it wasn't anyone re-reading the rule. It was the terminology check flagging the leftover occurrences of the old term, two work-slices after the rename had supposedly finished. The durable fix wasn't to restate the rule more firmly. It was to add a rename-completeness gate that sweeps every tracked surface before a rename can be called done. Prose discipline failed; the gate held. Once you've watched that happen to a rule you'd have sworn was unambiguous, "just write it down more clearly" stops being a credible plan.&lt;/p&gt;

&lt;p&gt;That's what you'd expect, too, if you think about what's actually steering the generation. The model optimizes toward plausible, working-looking code. Just as important, it optimizes toward a &lt;em&gt;manageable&lt;/em&gt; unit of it: it scopes each task to a meaningful chunk that fits its context budget, aiming for a sensible increment rather than an exhaustive pass it never promised, and it does not read every spec line and every file before acting. Completeness across every surface is precisely the guarantee it doesn't offer. Against that, a glossary entry that says &lt;em&gt;call it &lt;code&gt;disposition&lt;/code&gt;, not &lt;code&gt;outcome_code&lt;/code&gt;&lt;/em&gt; is a soft constraint with no gradient. Nothing pushes back when it's missed. The compiler doesn't care. The tests stay green either way. So the constraint quietly loses to every harder signal around it, and the gap surfaces much later, by eye, one stray term at a time.&lt;/p&gt;

&lt;p&gt;The conclusion is uncomfortable and, I think, correct: &lt;strong&gt;a non-negotiable rule cannot live in a prompt. It has to be a check that fails the moment the rule is broken.&lt;/strong&gt; In this project the checks run as pre-commit hooks, so the gate trips before a change can be committed. For code changes the compiler has usually already run and passed; that was never the question. The gate checks the thing the compiler can't. But the principle is the point: if a rule isn't enforced by something that can fail, it isn't a rule. It's a preference you're paying an attention-tax to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standing on others' shoulders
&lt;/h2&gt;

&lt;p&gt;I'm not the first to land on "enforce, don't suggest," and I want to credit the people who got there first, because the idea is genuinely theirs and the piece I'm adding sits on top of it.&lt;/p&gt;

&lt;p&gt;Factory.ai's Alvin Sng made the case for &lt;a href="https://factory.ai/news/using-linters-to-direct-agents" rel="noopener noreferrer"&gt;using linters to direct agents&lt;/a&gt;: encode the conventions as lint rules the agent runs against, rather than prose it has to remember. InfoQ has been developing the &lt;a href="https://www.infoq.com/articles/architectural-governance-ai-speed/" rel="noopener noreferrer"&gt;"architectural governance at AI speed"&lt;/a&gt; argument, with fitness functions and machine-enforceable statements of architectural intent so oversight can keep pace with generation. And Birgitta Böckeler's &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;"harness engineering"&lt;/a&gt; gives the cleanest vocabulary I've found: a harness has &lt;em&gt;guides&lt;/em&gt;, which steer the agent before it acts, and &lt;em&gt;sensors&lt;/em&gt;, which observe after it acts and let it self-correct. In those terms, a prose rule is a guide with no sensor behind it. None of these authors would be surprised by my 737; the framing was waiting for me when I went looking.&lt;/p&gt;

&lt;p&gt;There's a deeper reason this works, and it's the thing I'd add to the conversation. A coding agent is a non-deterministic instrument with a real, non-zero cost per call: wonderful at creative, ambiguous work, but variable and dear when you lean on it for things that have a single right answer. A neighboring line of work, &lt;em&gt;LLM-as-judge&lt;/em&gt;, leans into that by having one model grade another. It's a complementary track, and not the one this essay takes, but it has landed independently on the same split I did. (I arrived at it from the work, not from their write-ups; I found the literature afterward and was glad to see the agreement.) As Braintrust &lt;a href="https://www.braintrust.dev/articles/what-is-llm-as-a-judge" rel="noopener noreferrer"&gt;puts it&lt;/a&gt;, deterministic checks should handle "everything that can be measured directly" because they're "inexpensive and predictable," while an LLM judge is reserved for the subjective dimensions that genuinely need language understanding. "Did this rename reach every surface?" is not a subjective question. So rather than reach for a second model to watch the first, I push it onto a formal, deterministic tool: a script. That isn't just cheaper and more reliable than asking the model to remember. It &lt;em&gt;frees the model's headroom&lt;/em&gt; for the work only it can do. Every rule you move from the probabilistic plane to the formal one buys back creative capacity you were spending on bookkeeping.&lt;/p&gt;

&lt;p&gt;So where does that leave the contribution? The prior art aims this enforcement mostly at &lt;em&gt;architecture&lt;/em&gt;: layering, dependency direction, the shape of the system. That's the natural first target, since a forbidden import is a one-line grep and an architectural violation is loud. What I found is that the same enforcement belongs one layer in, at the thing Domain-Driven Design (DDD) says actually carries the meaning: the &lt;strong&gt;ubiquitous language&lt;/strong&gt; (UL), the shared vocabulary that runs from a conversation with a domain expert all the way to a class name. And Eric Evans, in &lt;em&gt;Domain-Driven Design&lt;/em&gt; (Addison-Wesley, 2003), is emphatic that this language is not a glossary you write once. It is &lt;em&gt;crunched&lt;/em&gt; and distilled continuously; a ubiquitous language, he writes, must &lt;em&gt;evolve&lt;/em&gt; as the team's understanding sharpens. That evolution was the engine of all my renames.&lt;/p&gt;

&lt;p&gt;Here's the twist that's specific to doing DDD with an agent. An LLM is, in one sense, a gift to Evans' program: it makes evolving the language dramatically faster and cheaper, so you can carry a rename through specs and code in an afternoon that would once have been a week's careful refactor. But the same speed multiplies the failure mode. Every fast, agent-driven rename sheds residue on the surfaces the agent didn't think to sweep, and it sheds it faster than you can track by eye. The tool that finally makes continuous distillation practical is the same tool that makes its incompleteness dangerous. That is the edge I hit, and it's why a static instructions file is a liability here in particular: the words it pins down are exactly the words you're now changing several times a week. The obvious move, the one I reached for first, is to &lt;em&gt;feed the glossary to the model as context&lt;/em&gt; and trust it to comply. That's UL-as-context, and it's precisely the prose-rule that produced my 737. The wedge isn't "enforce the UL &lt;em&gt;instead of&lt;/em&gt; the architecture." It's that enforcement belongs at the UL layer too, and, once you have the habit, at every layer where two artifacts are supposed to agree.&lt;/p&gt;

&lt;h2&gt;
  
  
  The case study: a linter for meaning
&lt;/h2&gt;

&lt;p&gt;The first gate was the terminology check that produced the 737. I built it because terminology inconsistency was already a known, nagging problem, one I'd written down as something we were struggling with. The same concept kept showing up under two names across code and docs, accumulating faster than I could chase it. So the 737 wasn't a gotcha. It was the first honest measurement of a debt I already knew I was carrying. Mechanically the check is unglamorous: it scans the source (Rust, markdown, config, protos, the TypeScript frontend) for retired names and for terms that don't match the glossary. It's a grep, effectively instant, and it runs automatically, on a hook when the agent reads a spec and again before any commit can land.&lt;/p&gt;

&lt;p&gt;The question worth dwelling on is &lt;em&gt;why it has to exist at all&lt;/em&gt;, given everything already guarding the code. I wasn't relying on the agent's memory alone. I had a compiler, a test suite, a glossary (the written home of the ubiquitous language) that named every old and new term, spec amendments derived from that glossary, and separate implementation sessions that turned those specs into code. That is a lot of discipline. And the residue still got through, because every one of those guardrails operates either above the code, like the glossary and the specs, or below the level of &lt;em&gt;meaning&lt;/em&gt;, like the compiler, which type-checks structure and not vocabulary. A struct called &lt;code&gt;OutcomeCode&lt;/code&gt; and a struct called &lt;code&gt;Disposition&lt;/code&gt; are equally valid Rust; only one of them is the agreed word. The terminology check occupies the band nothing else watched: string forms, enum variants, names in comments and specs, the directory called after a concept that no longer exists, the retired term that &lt;em&gt;reads&lt;/em&gt; fine and so sails through review, including the agent's review of its own work. It is a linter for meaning, and meaning is where domain bugs are born.&lt;/p&gt;

&lt;p&gt;One concrete instance taught me the mechanism better than any amount of theory. I renamed a module, and weeks later noticed, by eye, that a directory was still named after the old one. The rename had reached every place the compiler and the IDE's rename tooling could follow, and stopped at the edge of what they understand: a folder name is just a string. My first reaction is the part that matters. I didn't think &lt;em&gt;I should be more careful.&lt;/em&gt; I thought: &lt;em&gt;did I forget to register the old name as retired, so the check skipped right over it?&lt;/em&gt; That was exactly it. The check hadn't failed. It had faithfully looked for what it had been told to look for, and the residue was sitting where no rule yet pointed. The lesson isn't "even one word doesn't bind." It's sharper than that: a rename is only as complete as the mechanical check that can see the surfaces the compiler can't, and that check sees only what the glossary has taught it to see.&lt;/p&gt;

&lt;p&gt;Which is also why the honest limits matter, and why I keep them on the record. At one point I needed to retire a term whose name collided with an unrelated field in an audit schema, and no mechanical rule could tell the two uses apart. The commit, again in the agent's own words, recorded the decision to &lt;em&gt;deliberately not add that rule&lt;/em&gt;, on the grounds that you skip a rule rather than ship one that over-flags. That cuts in an inconvenient direction for a tidy thesis: defining a good formal check is not always easy, and sometimes it isn't possible. But the reasoning is right, and it's sharper when the check is a hard gate. Because these checks &lt;em&gt;block commits&lt;/em&gt;, a false positive can't simply be tuned out the way a noisy warning is. It does something worse. It forces you to bypass the gate, or to water the rule down until it stops flagging, and either way the gate quietly loses its authority, the very thing that made it worth more than prose. An over-flagging rule isn't a minor annoyance to live with. It's a slow repeal of the mechanism. That's why skipping the rule was the disciplined call. The blind spots, the generic terms you must skip and the placements the scan can't reach, aren't a footnote to apologize for. They're the boundary of the technique, and naming them honestly is what keeps a green checkmark meaning something. The gates are never finished, either: each real miss teaches the glossary to see one more thing, and the boundary moves outward over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scope and honest limits
&lt;/h2&gt;

&lt;p&gt;A word on the scope, because the setting does a lot of the work. This is a personal, still-evolving project, built entirely by directing agents, with one person, me, owning the ubiquitous language end to end. Treat what follows as a worked example, not a benchmark.&lt;/p&gt;

&lt;p&gt;Within that scope the result is clear, and it's the part I find interesting. Enforcing the language mechanically does what prose never managed. It holds the vocabulary consistent across a fast-moving, agent-written codebase, turns a class of silent regressions into loud ones, and gives me back the attention I'd otherwise spend on by-eye policing, attention that goes into design instead. The project is still under active development, and the gates earn their place daily. That is the contribution I stand behind.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;principle&lt;/strong&gt; travels: unenforced rules don't bind, and a ubiquitous language is worth enforcing, not merely describing. What I haven't tested is what &lt;em&gt;shared ownership&lt;/em&gt; does to it. My checks are cheap partly because one mind decides what a word means. The moment several people co-own the glossary, new questions open up: who arbitrates a contested term, what a growing body of checks costs to maintain, whether false-positive fatigue sets in and people start bypassing gates. I raise them because they're genuinely open, not because I think they sink the idea.&lt;/p&gt;

&lt;p&gt;One point cuts against the obvious worry, that all this governance must be expensive in agent tokens. In my experience it ran the other way. Before the gates, keeping the language consistent meant re-running broad, non-deterministic refactor passes and &lt;em&gt;still&lt;/em&gt; not trusting the result: high cost, uneven quality. Formal checks gave a deterministic, good-enough verdict, so I stopped paying for repeat LLM passes I didn't believe in. They also let me &lt;em&gt;shrink&lt;/em&gt; the always-loaded instructions. The rules got short, and the edge-case detail moved into an on-demand handbook that a failing check's output points to, exactly when it's relevant. Cheap scripts standing in for repeated probabilistic passes moved the net cost down, not up, which fits the earlier point about shifting work off the probabilistic plane.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this came from
&lt;/h2&gt;

&lt;p&gt;The starting goal was just a domain-dense Contact Center (CC) platform, built by directing coding agents. The vocabulary kept slipping, and the technique fell out of that collision. I care about methodology in general, which is why I invested in polishing what I found and adding a few pieces of my own, but I didn't go looking for one here. It came from the seam: deep domain language on one side, AI-assisted engineering on the other, and the discovery that the seam holds only when the rules can fail on their own rather than depend on anyone, human or model, remembering them.&lt;/p&gt;

&lt;p&gt;That intersection, Contact Center domain rigor met with agent-driven delivery, is where I spend my time now. If it's a problem you're working on too, I'm reachable on &lt;a href="https://www.linkedin.com/in/vasyl-tretiakov-b850231b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Written by directing an AI agent, the same way the platform it describes was built. The editing and the judgment are mine.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Alvin Sng, "Using Linters to Direct Agents," Factory.ai, 5 Sep 2025.
&lt;a href="https://factory.ai/news/using-linters-to-direct-agents" rel="noopener noreferrer"&gt;https://factory.ai/news/using-linters-to-direct-agents&lt;/a&gt; (accessed 1 Jun 2026).&lt;/li&gt;
&lt;li&gt;"Architectural Governance at AI Speed," InfoQ, 26 Mar 2026.
&lt;a href="https://www.infoq.com/articles/architectural-governance-ai-speed/" rel="noopener noreferrer"&gt;https://www.infoq.com/articles/architectural-governance-ai-speed/&lt;/a&gt; (accessed 1 Jun 2026).&lt;/li&gt;
&lt;li&gt;Birgitta Böckeler, "Harness engineering for coding agent users," martinfowler.com,
2 Apr 2026. &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;https://martinfowler.com/articles/harness-engineering.html&lt;/a&gt; (accessed 1 Jun 2026).&lt;/li&gt;
&lt;li&gt;"What is an LLM-as-a-judge?" Braintrust, 26 Feb 2026.
&lt;a href="https://www.braintrust.dev/articles/what-is-llm-as-a-judge" rel="noopener noreferrer"&gt;https://www.braintrust.dev/articles/what-is-llm-as-a-judge&lt;/a&gt; (accessed 1 Jun 2026).&lt;/li&gt;
&lt;li&gt;Eric Evans, &lt;em&gt;Domain-Driven Design: Tackling Complexity in the Heart of Software&lt;/em&gt;.
Addison-Wesley, 2003. ISBN 978-0321125217.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>domaindrivendesign</category>
    </item>
  </channel>
</rss>
