<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vasyl Tretiakov</title>
    <description>The latest articles on DEV Community by Vasyl Tretiakov (@vasyltretiakov).</description>
    <link>https://dev.to/vasyltretiakov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3962696%2Fbc2d96fa-fec7-4888-bef4-60cf52ebb67d.jpg</url>
      <title>DEV Community: Vasyl Tretiakov</title>
      <link>https://dev.to/vasyltretiakov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vasyltretiakov"/>
    <language>en</language>
    <item>
      <title>Rails, Not Rules: enforcing a coding agent's domain vocabulary with checks</title>
      <dc:creator>Vasyl Tretiakov</dc:creator>
      <pubDate>Mon, 01 Jun 2026 12:28:44 +0000</pubDate>
      <link>https://dev.to/vasyltretiakov/rails-not-rules-enforcing-a-coding-agents-domain-vocabulary-with-checks-3ea4</link>
      <guid>https://dev.to/vasyltretiakov/rails-not-rules-enforcing-a-coding-agents-domain-vocabulary-with-checks-3ea4</guid>
      <description>&lt;p&gt;&lt;em&gt;Why I stopped telling my coding agent the domain language and started enforcing it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I wrote my coding agent's rules for the project's terminology as prose on April 24. I wrote a script to &lt;em&gt;check&lt;/em&gt; those rules on May 2. The first run found 737 violations, in a codebase of roughly 150,000 lines that the agent had written under my direction and that I had, ostensibly, been keeping consistent the whole way.&lt;/p&gt;

&lt;p&gt;It's worth being precise about what those 737 were, because the easy reading (&lt;em&gt;the agent ignored my rules&lt;/em&gt;) is wrong, and the real story is more useful. I'd adopted a spec-driven way of working. When the domain language needed sharpening, a rename began in the glossary, with the old term retired and the new one defined; from there it flowed into spec amendments, and only then reached the code. Each rename was supposed to be total. None of them was. A rename propagates cleanly through the parts a compiler can see, and leaves a residue everywhere it can't: a directory still named after the old concept, a string in the frontend, a term in a comment, and often enough in the specs themselves, even though the specs are far smaller than the code and were being actively rewritten as part of the rename. I kept catching that residue by eye, one piece at a time, with the uneasy sense that I was only seeing a fraction of it. The 737 was the first time I built something that could count the rest. The rules in the file weren't wrong, and weren't being defied. They just couldn't &lt;em&gt;see&lt;/em&gt; the artifacts they governed: not the code, and not even the specs being rewritten right alongside it. A sentence in a markdown file is a wish. The check was the first thing in the project that could tell me how far the wish was from true.&lt;/p&gt;

&lt;p&gt;This essay is about the move that number forced: from writing the domain language down to enforcing it mechanically. Not as a productivity hack, but as the thing that turned out to actually keep a fast-moving, agent-built codebase consistent with its own vocabulary. If you've written a careful instructions file for an AI coding tool and watched the code drift out from under it anyway, this is the same story with a receipt attached, and a more useful diagnosis than "the model won't listen."&lt;/p&gt;

&lt;h2&gt;
  
  
  Prose doesn't bind
&lt;/h2&gt;

&lt;p&gt;There's a familiar version of this. &lt;em&gt;I wrote 200 lines of rules and the model ignored them all.&lt;/em&gt; The usual explanations are context-window pressure, instruction-following limits, the rules competing with everything else in the prompt. Probably all true. But framing it as the model misbehaving misses the more general mechanism, which has nothing to do with whether the agent is trying to comply. Even a perfectly diligent collaborator, human or model, can't apply a rule to a surface they don't know has changed. The rule's presence in the prompt and the rule's effect on the codebase are two different facts, and the gap between them stays invisible exactly where nothing measures it.&lt;/p&gt;

&lt;p&gt;The most striking version of this I have on record came from the agent itself. Mid-project, after one more written reminder had failed to stick, I asked it point-blank whether putting the rule into its persistent instructions would be enough. Its answer argued against its own convenience:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Honest answer: no. … it's just discipline with extra steps … it only works if I read it and then &lt;em&gt;choose&lt;/em&gt; correctly in the moment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the whole thesis, volunteered by the system the thesis is about. A written instruction it is free to not act on is not a constraint, however firmly phrased. It's a preference. The fix can't be a better-worded preference; it has to be something that doesn't depend on the agent choosing right in the moment.&lt;/p&gt;

&lt;p&gt;The retrospective receipt arrived the same week. I had the agent split its own instructions file in two and mechanize the rules that had been pure prose. The commit message (written, like nearly all of them in this project, by the agent reflecting on its own behavior) says why in five words I keep coming back to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;misses clustered on un-gated prose rules.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When it mapped where the codebase had actually fallen out of step with its own rules, the misses weren't randomly distributed. They clustered on exactly the rules that had nothing behind them but having been written down. The rules that &lt;em&gt;did&lt;/em&gt; have a check, a script or a hook or anything that could fail a run, were not where the drift lived. The pattern has nothing to do with defiance. It's about consequence. A rule with no mechanical consequence is indistinguishable, from inside the work, from no rule at all.&lt;/p&gt;

&lt;p&gt;A single episode made this concrete. The instructions stated, in plain language, that a rename must sweep every surface in lockstep: old name retired everywhere, no stragglers (the agent's own phrasing). It's hard to write a clearer rule. And a rename still went out leaving a batch of sites unswept, not only in the stringy corners of the code a compiler never reads, but in the specs themselves, the very documents being rewritten to drive the rename. What caught it wasn't anyone re-reading the rule. It was the terminology check flagging the leftover occurrences of the old term, two work-slices after the rename had supposedly finished. The durable fix wasn't to restate the rule more firmly. It was to add a rename-completeness gate that sweeps every tracked surface before a rename can be called done. Prose discipline failed; the gate held. Once you've watched that happen to a rule you'd have sworn was unambiguous, "just write it down more clearly" stops being a credible plan.&lt;/p&gt;

&lt;p&gt;That's what you'd expect, too, if you think about what's actually steering the generation. The model optimizes toward plausible, working-looking code. Just as important, it optimizes toward a &lt;em&gt;manageable&lt;/em&gt; unit of it: it scopes each task to a meaningful chunk that fits its context budget, aiming for a sensible increment rather than an exhaustive pass it never promised, and it does not read every spec line and every file before acting. Completeness across every surface is precisely the guarantee it doesn't offer. Against that, a glossary entry that says &lt;em&gt;call it &lt;code&gt;disposition&lt;/code&gt;, not &lt;code&gt;outcome_code&lt;/code&gt;&lt;/em&gt; is a soft constraint with no gradient. Nothing pushes back when it's missed. The compiler doesn't care. The tests stay green either way. So the constraint quietly loses to every harder signal around it, and the gap surfaces much later, by eye, one stray term at a time.&lt;/p&gt;

&lt;p&gt;The conclusion is uncomfortable and, I think, correct: &lt;strong&gt;a non-negotiable rule cannot live in a prompt. It has to be a check that fails the moment the rule is broken.&lt;/strong&gt; In this project the checks run as pre-commit hooks, so the gate trips before a change can be committed. For code changes the compiler has usually already run and passed; that was never the question. The gate checks the thing the compiler can't. But the principle is the point: if a rule isn't enforced by something that can fail, it isn't a rule. It's a preference you're paying an attention-tax to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standing on others' shoulders
&lt;/h2&gt;

&lt;p&gt;I'm not the first to land on "enforce, don't suggest," and I want to credit the people who got there first, because the idea is genuinely theirs and the piece I'm adding sits on top of it.&lt;/p&gt;

&lt;p&gt;Factory.ai's Alvin Sng made the case for &lt;a href="https://factory.ai/news/using-linters-to-direct-agents" rel="noopener noreferrer"&gt;using linters to direct agents&lt;/a&gt;: encode the conventions as lint rules the agent runs against, rather than prose it has to remember. InfoQ has been developing the &lt;a href="https://www.infoq.com/articles/architectural-governance-ai-speed/" rel="noopener noreferrer"&gt;"architectural governance at AI speed"&lt;/a&gt; argument, with fitness functions and machine-enforceable statements of architectural intent so oversight can keep pace with generation. And Birgitta Böckeler's &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;"harness engineering"&lt;/a&gt; gives the cleanest vocabulary I've found: a harness has &lt;em&gt;guides&lt;/em&gt;, which steer the agent before it acts, and &lt;em&gt;sensors&lt;/em&gt;, which observe after it acts and let it self-correct. In those terms, a prose rule is a guide with no sensor behind it. None of these authors would be surprised by my 737; the framing was waiting for me when I went looking.&lt;/p&gt;

&lt;p&gt;There's a deeper reason this works, and it's the thing I'd add to the conversation. A coding agent is a non-deterministic instrument with a real, non-zero cost per call: wonderful at creative, ambiguous work, but variable and dear when you lean on it for things that have a single right answer. A neighboring line of work, &lt;em&gt;LLM-as-judge&lt;/em&gt;, leans into that by having one model grade another. It's a complementary track, and not the one this essay takes, but it has landed independently on the same split I did. (I arrived at it from the work, not from their write-ups; I found the literature afterward and was glad to see the agreement.) As Braintrust &lt;a href="https://www.braintrust.dev/articles/what-is-llm-as-a-judge" rel="noopener noreferrer"&gt;puts it&lt;/a&gt;, deterministic checks should handle "everything that can be measured directly" because they're "inexpensive and predictable," while an LLM judge is reserved for the subjective dimensions that genuinely need language understanding. "Did this rename reach every surface?" is not a subjective question. So rather than reach for a second model to watch the first, I push it onto a formal, deterministic tool: a script. That isn't just cheaper and more reliable than asking the model to remember. It &lt;em&gt;frees the model's headroom&lt;/em&gt; for the work only it can do. Every rule you move from the probabilistic plane to the formal one buys back creative capacity you were spending on bookkeeping.&lt;/p&gt;

&lt;p&gt;So where does that leave the contribution? The prior art aims this enforcement mostly at &lt;em&gt;architecture&lt;/em&gt;: layering, dependency direction, the shape of the system. That's the natural first target, since a forbidden import is a one-line grep and an architectural violation is loud. What I found is that the same enforcement belongs one layer in, at the thing Domain-Driven Design (DDD) says actually carries the meaning: the &lt;strong&gt;ubiquitous language&lt;/strong&gt; (UL), the shared vocabulary that runs from a conversation with a domain expert all the way to a class name. And Eric Evans, in &lt;em&gt;Domain-Driven Design&lt;/em&gt; (Addison-Wesley, 2003), is emphatic that this language is not a glossary you write once. It is &lt;em&gt;crunched&lt;/em&gt; and distilled continuously; a ubiquitous language, he writes, must &lt;em&gt;evolve&lt;/em&gt; as the team's understanding sharpens. That evolution was the engine of all my renames.&lt;/p&gt;

&lt;p&gt;Here's the twist that's specific to doing DDD with an agent. An LLM is, in one sense, a gift to Evans' program: it makes evolving the language dramatically faster and cheaper, so you can carry a rename through specs and code in an afternoon that would once have been a week's careful refactor. But the same speed multiplies the failure mode. Every fast, agent-driven rename sheds residue on the surfaces the agent didn't think to sweep, and it sheds it faster than you can track by eye. The tool that finally makes continuous distillation practical is the same tool that makes its incompleteness dangerous. That is the edge I hit, and it's why a static instructions file is a liability here in particular: the words it pins down are exactly the words you're now changing several times a week. The obvious move, the one I reached for first, is to &lt;em&gt;feed the glossary to the model as context&lt;/em&gt; and trust it to comply. That's UL-as-context, and it's precisely the prose-rule that produced my 737. The wedge isn't "enforce the UL &lt;em&gt;instead of&lt;/em&gt; the architecture." It's that enforcement belongs at the UL layer too, and, once you have the habit, at every layer where two artifacts are supposed to agree.&lt;/p&gt;

&lt;h2&gt;
  
  
  The case study: a linter for meaning
&lt;/h2&gt;

&lt;p&gt;The first gate was the terminology check that produced the 737. I built it because terminology inconsistency was already a known, nagging problem, one I'd written down as something we were struggling with. The same concept kept showing up under two names across code and docs, accumulating faster than I could chase it. So the 737 wasn't a gotcha. It was the first honest measurement of a debt I already knew I was carrying. Mechanically the check is unglamorous: it scans the source (Rust, markdown, config, protos, the TypeScript frontend) for retired names and for terms that don't match the glossary. It's a grep, effectively instant, and it runs automatically, on a hook when the agent reads a spec and again before any commit can land.&lt;/p&gt;

&lt;p&gt;The question worth dwelling on is &lt;em&gt;why it has to exist at all&lt;/em&gt;, given everything already guarding the code. I wasn't relying on the agent's memory alone. I had a compiler, a test suite, a glossary (the written home of the ubiquitous language) that named every old and new term, spec amendments derived from that glossary, and separate implementation sessions that turned those specs into code. That is a lot of discipline. And the residue still got through, because every one of those guardrails operates either above the code, like the glossary and the specs, or below the level of &lt;em&gt;meaning&lt;/em&gt;, like the compiler, which type-checks structure and not vocabulary. A struct called &lt;code&gt;OutcomeCode&lt;/code&gt; and a struct called &lt;code&gt;Disposition&lt;/code&gt; are equally valid Rust; only one of them is the agreed word. The terminology check occupies the band nothing else watched: string forms, enum variants, names in comments and specs, the directory called after a concept that no longer exists, the retired term that &lt;em&gt;reads&lt;/em&gt; fine and so sails through review, including the agent's review of its own work. It is a linter for meaning, and meaning is where domain bugs are born.&lt;/p&gt;

&lt;p&gt;One concrete instance taught me the mechanism better than any amount of theory. I renamed a module, and weeks later noticed, by eye, that a directory was still named after the old one. The rename had reached every place the compiler and the IDE's rename tooling could follow, and stopped at the edge of what they understand: a folder name is just a string. My first reaction is the part that matters. I didn't think &lt;em&gt;I should be more careful.&lt;/em&gt; I thought: &lt;em&gt;did I forget to register the old name as retired, so the check skipped right over it?&lt;/em&gt; That was exactly it. The check hadn't failed. It had faithfully looked for what it had been told to look for, and the residue was sitting where no rule yet pointed. The lesson isn't "even one word doesn't bind." It's sharper than that: a rename is only as complete as the mechanical check that can see the surfaces the compiler can't, and that check sees only what the glossary has taught it to see.&lt;/p&gt;

&lt;p&gt;Which is also why the honest limits matter, and why I keep them on the record. At one point I needed to retire a term whose name collided with an unrelated field in an audit schema, and no mechanical rule could tell the two uses apart. The commit, again in the agent's own words, recorded the decision to &lt;em&gt;deliberately not add that rule&lt;/em&gt;, on the grounds that you skip a rule rather than ship one that over-flags. That cuts in an inconvenient direction for a tidy thesis: defining a good formal check is not always easy, and sometimes it isn't possible. But the reasoning is right, and it's sharper when the check is a hard gate. Because these checks &lt;em&gt;block commits&lt;/em&gt;, a false positive can't simply be tuned out the way a noisy warning is. It does something worse. It forces you to bypass the gate, or to water the rule down until it stops flagging, and either way the gate quietly loses its authority, the very thing that made it worth more than prose. An over-flagging rule isn't a minor annoyance to live with. It's a slow repeal of the mechanism. That's why skipping the rule was the disciplined call. The blind spots, the generic terms you must skip and the placements the scan can't reach, aren't a footnote to apologize for. They're the boundary of the technique, and naming them honestly is what keeps a green checkmark meaning something. The gates are never finished, either: each real miss teaches the glossary to see one more thing, and the boundary moves outward over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scope and honest limits
&lt;/h2&gt;

&lt;p&gt;A word on the scope, because the setting does a lot of the work. This is a personal, still-evolving project, built entirely by directing agents, with one person, me, owning the ubiquitous language end to end. Treat what follows as a worked example, not a benchmark.&lt;/p&gt;

&lt;p&gt;Within that scope the result is clear, and it's the part I find interesting. Enforcing the language mechanically does what prose never managed. It holds the vocabulary consistent across a fast-moving, agent-written codebase, turns a class of silent regressions into loud ones, and gives me back the attention I'd otherwise spend on by-eye policing, attention that goes into design instead. The project is still under active development, and the gates earn their place daily. That is the contribution I stand behind.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;principle&lt;/strong&gt; travels: unenforced rules don't bind, and a ubiquitous language is worth enforcing, not merely describing. What I haven't tested is what &lt;em&gt;shared ownership&lt;/em&gt; does to it. My checks are cheap partly because one mind decides what a word means. The moment several people co-own the glossary, new questions open up: who arbitrates a contested term, what a growing body of checks costs to maintain, whether false-positive fatigue sets in and people start bypassing gates. I raise them because they're genuinely open, not because I think they sink the idea.&lt;/p&gt;

&lt;p&gt;One point cuts against the obvious worry, that all this governance must be expensive in agent tokens. In my experience it ran the other way. Before the gates, keeping the language consistent meant re-running broad, non-deterministic refactor passes and &lt;em&gt;still&lt;/em&gt; not trusting the result: high cost, uneven quality. Formal checks gave a deterministic, good-enough verdict, so I stopped paying for repeat LLM passes I didn't believe in. They also let me &lt;em&gt;shrink&lt;/em&gt; the always-loaded instructions. The rules got short, and the edge-case detail moved into an on-demand handbook that a failing check's output points to, exactly when it's relevant. Cheap scripts standing in for repeated probabilistic passes moved the net cost down, not up, which fits the earlier point about shifting work off the probabilistic plane.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this came from
&lt;/h2&gt;

&lt;p&gt;The starting goal was just a domain-dense Contact Center (CC) platform, built by directing coding agents. The vocabulary kept slipping, and the technique fell out of that collision. I care about methodology in general, which is why I invested in polishing what I found and adding a few pieces of my own, but I didn't go looking for one here. It came from the seam: deep domain language on one side, AI-assisted engineering on the other, and the discovery that the seam holds only when the rules can fail on their own rather than depend on anyone, human or model, remembering them.&lt;/p&gt;

&lt;p&gt;That intersection, Contact Center domain rigor met with agent-driven delivery, is where I spend my time now. If it's a problem you're working on too, I'm reachable on &lt;a href="https://www.linkedin.com/in/vasyl-tretiakov-b850231b/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Written by directing an AI agent, the same way the platform it describes was built. The editing and the judgment are mine.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Alvin Sng, "Using Linters to Direct Agents," Factory.ai, 5 Sep 2025.
&lt;a href="https://factory.ai/news/using-linters-to-direct-agents" rel="noopener noreferrer"&gt;https://factory.ai/news/using-linters-to-direct-agents&lt;/a&gt; (accessed 1 Jun 2026).&lt;/li&gt;
&lt;li&gt;"Architectural Governance at AI Speed," InfoQ, 26 Mar 2026.
&lt;a href="https://www.infoq.com/articles/architectural-governance-ai-speed/" rel="noopener noreferrer"&gt;https://www.infoq.com/articles/architectural-governance-ai-speed/&lt;/a&gt; (accessed 1 Jun 2026).&lt;/li&gt;
&lt;li&gt;Birgitta Böckeler, "Harness engineering for coding agent users," martinfowler.com,
2 Apr 2026. &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;https://martinfowler.com/articles/harness-engineering.html&lt;/a&gt; (accessed 1 Jun 2026).&lt;/li&gt;
&lt;li&gt;"What is an LLM-as-a-judge?" Braintrust, 26 Feb 2026.
&lt;a href="https://www.braintrust.dev/articles/what-is-llm-as-a-judge" rel="noopener noreferrer"&gt;https://www.braintrust.dev/articles/what-is-llm-as-a-judge&lt;/a&gt; (accessed 1 Jun 2026).&lt;/li&gt;
&lt;li&gt;Eric Evans, &lt;em&gt;Domain-Driven Design: Tackling Complexity in the Heart of Software&lt;/em&gt;.
Addison-Wesley, 2003. ISBN 978-0321125217.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>domaindrivendesign</category>
    </item>
  </channel>
</rss>
