Vasyl Tretiakov

Posted on Jun 22 • Originally published at vasyltretiakov.dev

Make It a Check: compounding a coding agent's lessons as gates, not CLAUDE.md prose

#ai #agents #programming #llm

Compound engineering writes each lesson into the agent's prose. The ones that matter should be checks instead: prose drifts, a gate doesn't.

The canonical guide to compound engineering describes its final step, the one it names compound, in plain instructions: once a task is done, you "Add new patterns into CLAUDE.md, the file the agent reads at the start of every session." I followed that advice for months. The patterns piled up, the agent dutifully re-read them at the start of every session, and a convention I genuinely cared about kept drifting anyway. The rule was right there in the always-loaded file. The agent's output drifted from it under context pressure, fluently and confidently, reaching for whichever term it had seen most recently in the file it was editing rather than the one the rule named. That was never defiance. The model doesn't decide to break a rule; under a crowded prompt it simply weights the tokens in front of it above a line it read far earlier.

The fix that finally held was not a better paragraph. It was a twelve-line script that reads the staged diff and blocks the commit when the wrong word appears. The first time I ran it across the project, it flagged 737 violations of a rule I had already written down and trusted the agent to follow. (I told that terminology-drift story in full in Rails, Not Rules.) That gap between "written down" and "enforced" is the whole essay. The lesson worth keeping should not be added to the prose the agent must remember. It should become the thing that catches the mistake automatically the next time.

The loop is right. The last step isn't.

Compound engineering gets the loop right, and I want to concede that before I disagree with any of it. The premise, that "each unit of engineering work should make subsequent units easier, not harder," is exactly the right ambition for directing an agent. You plan, you work, you review, and then you compound: you feed what you learned back so the next run starts smarter. I run a version of that loop every day, over a codebase of roughly 150,000 lines that a coding agent and I built together, and four published essays came out of watching it run.

My disagreement is narrow and it lives entirely in the last step. Compound engineering's compound step terminates in remembered prose, and remembered prose is the one thing this loop has taught me an agent cannot be trusted to honor. Here is the honest part, though: the same guide, in the very last sub-step of compound, asks a sharper question than its own advice answers. It says to check whether "the system [would] catch this automatically next time." That sentence is the argument I want to make, left sitting as an aspiration next to the instruction that undercuts it. Don't add the lesson to the file the agent reads every session and still applies unevenly. Build the thing that catches the mistake automatically next time.

Why prose loses and a check wins

The reason is an asymmetry I spent a whole companion essay on, so I will state it once and move on. An LLM is a high-throughput, low-consistency generator: prolific, genuinely creative, and unreliable about rule forty-seven when that rule is one line among hundreds of always-loaded instructions competing for its attention. A paragraph in an instruction file leans on that generator to remember and apply the rule every single time. A check does the opposite. It is a cheap, deterministic verifier sitting in front of the generator, and verifying that a constraint holds is far cheaper and far more trustworthy than generating output that satisfies it on the first try, every time. Putting the rule in prose bets on the unreliable half. Putting it in a check bets on the reliable half.

There is a cost asymmetry stacked on top of the reliability one, and it is the part that surprised me. When I worked through whether a slightly redundant check was worth keeping, the accounting came out lopsided: "a doc paragraph costs tokens every session; a redundant gate costs only when it runs." The part of that cost the agent actually pays, tokens in its context, is narrower still. A check that passes is silent; the agent never sees it and pays nothing. Only a failure spends tokens, and only to feed back the one message that steers the next attempt. Prose in the always-loaded file has no such off switch. It is a tax the agent pays on every task, forever, whether or not the rule is relevant to what it is doing, and the prose you accumulate to make the agent smarter slowly crowds out everything else it needs to hold in mind. Compounding lessons as text does not just drift. It bloats the exact surface whose scarcity caused the drift.

The loop, named as a loop

If a check beats a paragraph, the practice is no longer "compound the lessons" but something more specific. Observe the failure as it surfaces in the work, grade it against a cost test, and mechanize the half you can make false-positive-free. The rest stays discipline, on purpose. That loop is not a proposal. It is the thing that, run for a few months, produced four essays, each one a different output of the same step:

A vocabulary that kept drifting became an enforced terminology check (Rails, Not Rules).
Two artifacts that were supposed to agree, and silently didn't, became a bidirectional coupling check (Couple Both Ways).
The question of which lessons earn a check became a cost test (Gates Earned From Failure).
And the realization that a large fraction of these checks govern the process rather than the code became the compiler essay I just cited.

Read that way, the four are not separate ideas. They are four turns of one crank. Compound engineering and I agree on the crank. We disagree on what comes out the end: its version produces more text for the agent to read, mine produces a script that makes the text unnecessary.

Read the transcripts, not just the diffs

The "observe" step has a wrinkle worth its own paragraph, because it is where my version of the loop gets most of its material. Git history tells you what shipped. It is a poor record of why, and it can make deliberate work look accidental and accidental work look deliberate. The session transcripts record the second thing: the reasoning, the dead ends, the moment a convention first cracked. I mine them later like a dated archive, which only works because a transcript is immutable. An idea you didn't act on is dormant in there, not lost.

That habit also keeps me honest, which I learned the embarrassing way. While drafting an earlier essay, a research note of mine attributed a tidy phrase to a named expert. It was load-bearing for a whole section. When I went to verify it against her actual work before publishing, the phrase appeared nowhere she had ever written. I had fabricated it, or my notes had, and the prose felt completely authoritative while being false. The live check caught what my own confident memory did not. That is the same lesson as the terminology gate, turned on myself: the durable, re-runnable verification holds; the remembered version drifts, even when the rememberer is me.

Encode the rule, not the state

Granting that a lesson should become a check, there is a craft to making one that survives. The sharpest framing I have for it came out of a session where I asked the agent, half-seriously, what software could borrow from how DNA encodes a body: "DNA manages to express these body-build instructions in [a] very succinct yet effective way, e.g. each cell lies within 5 cells from a blood vessel." The useful answer was the opposite of compression. The rule it landed on was to "store the generative rule plus a gate that re-derives the invariant, never the enumerated current state."

DNA never stores "every cell sits within five cells of a vessel." That is an emergent property. It encodes a local rule, roughly "a starved cell releases a signal that grows a vessel toward it," and the global property falls out. The software analog is direct. Every place an instruction file enumerates the current state of the world, a count, a list of components, the set of valid destinations, is a place that will rot the moment the world changes and nobody updates the prose. Encode the generating rule and let a check regenerate the fact on demand, and there is nothing to keep in sync. A related move is what I think of as the promoter pattern: "don't express the genome, transcribe the gene when its signal arrives," which in practice means an index that loads the one relevant section only when the task needs it, rather than paying for the whole document every session. I am not alone in converging here. Birgitta Böckeler describes the same shift away from always-on context toward skills where "the agent only loads everything that's in that skill folder when the LLM thinks that this is relevant," and Thoughtworks' latest Technology Radar files "Agent Skills" under feedforward controls for the same reason.

One borrowed instinct is worth refusing. DNA is robust to point mutations because it has no per-cell repair supervisor, so it tolerates redundant, degenerate encodings. A governed agent system wants the opposite. You have a repair agent, so you want fail-closed brittleness: a near-miss should break the build loudly, not quietly still work. Borrowing DNA's fault tolerance would just hide drift behind a green checkmark. Keep the checks brittle.

The half I deliberately didn't automate

A synthesis about mechanizing lessons owes you the place it refused to. The step where a failure gets noticed and a few get promoted into checks is neither automated nor a tidy review pass. There is no batch job that reads a session at close and proposes gates; the judgment happens in real time, in the back-and-forth while the work is underway. Usually I am the one who has to ask "should this become a check?", because the agent does not reliably surface that question on its own; in my experience it is simply not what the model is tuned to reach for. I have looked at automating the front of it, a scanner that reads every transcript and proposes every candidate check, and declined. It would be judgment-heavy and noisy, and a noisy suggestion stream is one you learn to ignore. The cost test that governs the code governs the tooling too: a miss here is cheap and recoverable, so the honest call is to leave the noticing to the live debate and not build the tool. Saying "this is where I chose not to automate" is the anti-hype move the rest of the argument has to earn.

There is a quieter version of the same discipline that I had to be taught mid-session. Having built a wrapper to settle a question once, I caught myself about to re-audit its internals by hand "just to be sure." The correction was crisp: "don't re-orchestrate its steps and don't pre-audit its internals. Verifying loudly fails at runtime if the runner's actually broken; pre-emptive grepping just re-derives what the script exists to settle." If you trust the check, trust it. Re-checking it by hand pays back the exact cost the check was built to eliminate.

What I'm standing on, and what I'm not claiming

None of the pieces here is mine alone, and pretending otherwise would be the tell this audience is right to distrust. The generator/verifier asymmetry is the oldest idea in our tooling: type checkers, linters, continuous integration, and the plain test suite all exist because checking a property is cheaper than producing code guaranteed to have it. The loop is compound engineering's. The front half of it, starting from a written specification, is now a named and tracked technique; spec-driven development has tools like GitHub's Spec Kit and Amazon's Kiro, and Spec Kit even ships a "constitution" of "immutable principles that must always be followed." I would only note that a constitution is still prose the agent must honor, which is precisely the surface I am arguing drifts unless something enforces it. And the cost Thoughtworks names "cognitive debt," the widening gap between humans and the systems agents generate, is the thing this whole practice is trying to pay down. My one narrow claim is the join: that the compound step is cheaper and more durable as a check than as a remembered paragraph, wherever the lesson is regular enough to verify mechanically.

Honest limitations

The deepest limit is that this is not the agent learning anything. It is me externalizing my own attention. The deficit a check repairs is salience, not memory, which is also why I think the practice transfers to human teams who never had a memory problem to begin with. That argument deserves its own essay and is getting one, so I will leave it as a claim here rather than develop it.

Three smaller limits. This is one person on one codebase, where I both own the domain model and write the checks that judge it, so the verifier is never adversarially independent of the thing it verifies. Böckeler puts the same worry one layer up when she notes that "the agent also generated the tests," and a verifier the generator authored is exactly the weak feedback I am warning about. Second, the un-mechanizable half really does stay prose; the argument is "graduate the lessons you can verify cleanly," never "automate everything," and the judgment of which is which is the skill that doesn't reduce to a script. Third, the genre I am positioning against is moving fast, and some of what reads as disagreement today may just be where the practice lands next quarter. Its own "would the system catch this next time?" suggests the distance between us is shorter than the framing implies.

Why this generalizes

Strip away my particular project and the shape is plain. Anthropic's 2026 Agentic Coding Trends Report frames the industry move as a shift "from writing code to orchestrating agents that write code," and orchestration is exactly the context where this arithmetic starts to bite. Any agent working on a governed system accumulates drift faster than a human would, because it has no continuity between sessions, and it will author the check that catches that drift in a few minutes, because writing the check is the kind of small, bounded task it is good at. High drift and nearly-free enforcement is a combination no human team has ever faced, and it changes the arithmetic of the compound step. When a lesson is regular enough to verify, writing it into prose the agent must remember is the more expensive option and the one that fails. So when the loop comes back around to compound, and you are about to add another paragraph to the file the agent reads every morning, ask the better question the guide already buried in its own checklist: would the system catch this automatically next time? If it can, make it. If it can't, that is exactly the prose worth keeping.

This essay was written by directing a coding agent over the project and the four prior essays it synthesizes; I direct and judge, the agent drafts and argues back. The DNA framing and the "encode the rule, not the state" rule came out of one such argument, quoted here close to how it happened.

I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're working on, I'm reachable on LinkedIn.

References

Kieran Klaassen and Dan Shipper, "Compound Engineering," Every (accessed 17 Jun 2026). The loop this essay concedes and the compound step it argues against, including "Add new patterns into CLAUDE.md" and "Would the system catch this automatically next time?"
Vasyl Tretiakov, "Rails, Not Rules: Enforcing a Coding Agent's Domain Vocabulary with Checks," vasyltretiakov.dev, 1 Jun 2026 — the terminology check and the 737-violation receipt.
Vasyl Tretiakov, "Couple Both Ways," vasyltretiakov.dev, 9 Jun 2026 — the bidirectional coupling check.
Vasyl Tretiakov, "Gates Earned From Failure," vasyltretiakov.dev, 14 Jun 2026 — the cost test for grading a lesson.
Vasyl Tretiakov, "Compiling the Process," vasyltretiakov.dev, 17 Jun 2026 — the generator/verifier asymmetry, stated in full.
Birgitta Böckeler, "From MCP and Vibe Coding to Harness Engineering," InfoQ podcast, 8 Jun 2026 (accessed 14 Jun 2026). Sources "the agent also generated the tests" and the lazy-loaded-skills observation.
Thoughtworks, "Combat AI cognitive debt — Technology Radar Vol. 34," 15 Apr 2026 (accessed 17 Jun 2026). Source of "cognitive debt" and the feedforward/feedback control framing.
Thoughtworks, "Spec-driven development," Technology Radar (accessed 17 Jun 2026); GitHub Spec Kit's "constitution" of immutable principles. The named prior art for the loop's front half.
Anthropic, "2026 Agentic Coding Trends Report" (accessed 17 Jun 2026). Source of "from writing code to orchestrating agents that write code"; cited for the framing of the shift, not as evidence of results.
Eric Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. ISBN 978-0321125217. The ubiquitous-language lineage the terminology check enforces.

Published at vasyltretiakov.dev.

Top comments (1)

ANP2 Network • Jun 22

You split the lessons cleanly into the ones you can verify and the ones that stay prose, and that's the right cut for the endpoints. The case that bites is the middle: a rule that's semantically real but only approximately checkable, where a syntactic gate stands in for the intent. That gate isn't a verifier of the rule, it's a proxy for it, and a proxy the generator commits against becomes a Goodhart target. It learns to satisfy the predicate, not honor the intent, and unlike the drift you started with, that failure looks like success. "Keep the checks brittle" guards a near-miss on a correct predicate; it does nothing when the predicate itself is the approximation. And this is a sharper problem than the generator-authored-verifier worry you raise via Böckeler, because the gameability comes from the proxy gap, not from who wrote the check.

The other thing the check buys is a trust inversion worth naming. A stale paragraph you reread skeptically; a green check you trust, which is the whole point of having it. So a gate still enforcing a convention you've since abandoned fights the new one silently and with authority, and that's harder to catch than prose drift, not easier. "A redundant gate costs only when it runs" undersells the stale gate that runs and is confidently wrong. "If you trust the check, trust it" is right for a check whose rule still holds, and a quiet liability for one whose rule has moved on.

The transcript-mining habit is what closes that gap, if you push it one step further: have the gate carry the failure it was born from, not just the rule. Then when it misfires later you can ask what it originally caught and whether that still applies. A gate without that link is an unexplained deterministic veto. You can't drift from it, but you also can't reason about when to retire it, and a check you can't retire is the prose problem turned inside out.