<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ian Johnson</title>
    <description>The latest articles on DEV Community by Ian Johnson (@tacoda).</description>
    <link>https://dev.to/tacoda</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F171498%2Fb1207a6e-f740-43c4-bb64-c675e3b3ce1d.jpeg</url>
      <title>DEV Community: Ian Johnson</title>
      <link>https://dev.to/tacoda</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tacoda"/>
    <language>en</language>
    <item>
      <title>Harness Debt</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Fri, 05 Jun 2026 14:42:47 +0000</pubDate>
      <link>https://dev.to/tacoda/harness-debt-1p88</link>
      <guid>https://dev.to/tacoda/harness-debt-1p88</guid>
      <description>&lt;h1&gt;
  
  
  Harness Debt
&lt;/h1&gt;

&lt;p&gt;I sat down to read our CLAUDE.md end to end last month, the first time I had done it in maybe three months. Two rules contradicted each other. One of them was added in April; the other was added in January. Both reviewers had been right at the time. Neither had noticed the conflict, because nobody reads the whole file when they add a rule. They read the section the rule belongs in, find the right spot, and write.&lt;/p&gt;

&lt;p&gt;The agent had been silently obeying whichever rule appeared first in the file. The other rule was effectively dead. The team had paid the review cost of writing both, the token cost of carrying both, and got the behavior of one.&lt;/p&gt;

&lt;p&gt;That is harness debt. Not bloat, not staleness, not even bad rules. The accumulated cost of a document that grew faster than the discipline of maintaining it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shapes debt takes
&lt;/h2&gt;

&lt;p&gt;The most obvious shape is contradiction. Two rules that point in opposite directions, written months apart, both load-bearing in their original context, neither aware of the other. The agent picks one and the team thinks the other is firing.&lt;/p&gt;

&lt;p&gt;The second shape is orphans. A rule references a file path that no longer exists, a function that was renamed, a hook that was removed, a tool the team stopped using. The rule still reads like advice. The advice is now incoherent. The agent does its best to interpret it, and the result is whatever the agent does when it is improvising against a broken map.&lt;/p&gt;

&lt;p&gt;The third shape is unmotivated rules. Every rule had an incident behind it. The incidents are not in the file. Six months later, no one remembers why "always use the structured logger here" is in the harness, only that it has always been in the harness, and removing it feels risky in a way nobody can articulate. The rule survives on inertia. Half the time it is still right. Half the time the original constraint is gone and the rule is paying a tax for no return.&lt;/p&gt;

&lt;p&gt;The fourth shape is the rule that was right once and is wrong now. The codebase moved; the rule did not. The agent reads the rule, applies it, and produces a diff that fights the current architecture. The reviewer pushes back, the agent insists the harness says to do it this way, and now there is a small fight about a rule that should have been deleted a release ago.&lt;/p&gt;

&lt;p&gt;The debt is the sum of these. It compounds. Each contradiction, each orphan, each unmotivated rule increases the chance that the next person adding a rule will not notice the one they are conflicting with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the debt is invisible
&lt;/h2&gt;

&lt;p&gt;The harness has no failing tests. There is no CI job that flags an orphan reference. There is no metric that says "this rule has not fired in six months." The debt accumulates with no signal, in a document the team trusts, and the trust is what makes it expensive.&lt;/p&gt;

&lt;p&gt;The other reason it stays invisible: each rule was approved on its own merits. The team that approved the rule was not asked "does this conflict with anything in the file." They were asked "is this rule good." The review process for a rule is local; the cost of the rule is global. The mismatch is the debt-producing mechanism.&lt;/p&gt;

&lt;p&gt;The third reason is rotation. The people who added the early rules left the team. The institutional memory of the incidents is gone. The new maintainers read the harness as a finished artifact rather than as a record of decisions, and finished artifacts do not invite questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The audit
&lt;/h2&gt;

&lt;p&gt;Once a quarter, I sit with the harness open and a checklist. The checklist is short and the audit takes about two hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read it top to bottom.&lt;/strong&gt; Not skimming. Reading. The contradictions surface in the second pass; the first pass is just to load the whole thing into your head. Most quarters I find one or two real conflicts and three or four near-conflicts, where two rules do not strictly contradict but pull in opposite enough directions that the agent has to pick.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check every file path, function name, and tool reference.&lt;/strong&gt; A grep against the codebase. The orphans surface immediately. Some of them are easy fixes; the rule still applies, the name just changed. Some of them are signs the rule is obsolete because the thing it referenced no longer exists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For each rule, ask when it last fired.&lt;/strong&gt; The team review notes, the post-mortems, the PR comments where someone cited the rule, the agent's stopping messages. If the rule has not appeared in any of those for six months, it is a candidate. Not a confirmed delete; a candidate. Some rules are quiet because the agent now respects them, and removing them would let the failure mode back in. Others are quiet because the failure mode is gone for other reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For each rule, ask what its origin incident was.&lt;/strong&gt; If no one on the team can answer, the rule is on probation. Either someone reconstructs the reasoning, in which case the reasoning gets a one-line annotation in the rule for the next maintainer, or no one can, in which case the rule should be considered for deletion in the next audit.&lt;/p&gt;

&lt;p&gt;The audit produces a small set of changes: deletions, consolidations, scope adjustments, annotations. The patches land the same week.&lt;/p&gt;

&lt;h2&gt;
  
  
  The annotation rule
&lt;/h2&gt;

&lt;p&gt;The change that pays the most ongoing dividend is the cheapest one: every rule gets a one-line annotation when it is added, saying what incident or class of mistake it prevents.&lt;/p&gt;

&lt;p&gt;Not three paragraphs of reasoning. Not a link to the PR (links rot). One line. "Added after the migration that ran twice." "After the import path mistake in the analytics module." Enough that the next maintainer can ask "is that incident still possible" and get an answer.&lt;/p&gt;

&lt;p&gt;The cost is one line per rule. The benefit is that the next audit can actually answer the origin question for every rule, instead of guessing for half of them. The annotation is the cheapest debt-prevention move in the harness. The cost can also be mitigated by breaking up the reasoning from the rule with a reference loaded only on-demand. This is how Keystone gets this benefit while paying a minimal cost.&lt;/p&gt;

&lt;p&gt;We added it as a rule about rules. It is the only meta-rule in our CLAUDE.md and it has paid back many times over.&lt;/p&gt;

&lt;h2&gt;
  
  
  The garden
&lt;/h2&gt;

&lt;p&gt;The metaphor I find myself coming back to is gardening, not engineering. The harness is not a system you design once and maintain in place. It is a collection of small living rules, each of which can wither, each of which can crowd the others, each of which needs occasional pruning.&lt;/p&gt;

&lt;p&gt;A garden that is not gardened becomes a thicket. The thicket has plants in it, technically. The plants do not produce.&lt;/p&gt;

&lt;p&gt;The harness without quarterly attention becomes a thicket. The rules are in it, technically. They do not produce the behavior they once did, because contradictions and orphans and unmotivated rules have crowded the signal.&lt;/p&gt;

&lt;p&gt;The discipline is to garden. Not to redesign. Not to migrate. Just to walk the rows, pull the things that should not be there, and feed the things that should.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hardest deletion
&lt;/h2&gt;

&lt;p&gt;The rule I struggle most to delete is the one I added myself, after an incident I remember. The deletion feels like erasing the lesson.&lt;/p&gt;

&lt;p&gt;The framing that helps me: deleting the rule does not erase the lesson. The lesson is in the codebase, in the test suite, in the team's practice. The rule was the scaffolding that taught the lesson; the lesson is now load-bearing on its own. The scaffolding can come down.&lt;/p&gt;

&lt;p&gt;The check I run: if I deleted this rule today, would the next pull request reintroduce the failure mode. If the answer is "no, because the test covers it" or "no, because the lint rule catches it" or "no, because the team would catch it in review," the rule is scaffolding and can go. If the answer is "yes, the failure would come back," the rule is load-bearing and stays.&lt;/p&gt;

&lt;p&gt;Most rules I am attached to fail the test. The lesson stuck; the rule did not need to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keystone has an audit wheel
&lt;/h2&gt;

&lt;p&gt;The audit I described is manual. Two hours a quarter, a checklist, a person reading the file end to end. The discipline is real and the discipline is fragile: the quarter I skip is the quarter the contradictions land.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.tacoda.dev/keystone/" rel="noopener noreferrer"&gt;Keystone&lt;/a&gt; solves this by giving the harness an audit wheel. The pruning pass runs on a cadence, not on my memory. Rules that no longer apply get archived; the reasoning behind each rule lives next to the rule, so the audit answers the origin question without guessing.&lt;/p&gt;

&lt;p&gt;The mechanism is forgetting as a first-class operation. The agent does not just accumulate guides; it can drop them. A stale guide moves to an archive with its archival reason recorded. A contradictory pair surfaces during the pass and one of them is retired with a note pointing at the survivor.&lt;/p&gt;

&lt;p&gt;The agent forgets the way a gardener prunes: deliberately, with a reason, and with the cuttings kept somewhere in case the question comes back.&lt;/p&gt;

&lt;p&gt;This is the move I had been doing by hand. Keystone gives it a shape and a cadence. The harness stays small because the wheel turns; the wheel turns because it lives in the tool, not in my willpower.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would tell someone facing a long CLAUDE.md
&lt;/h2&gt;

&lt;p&gt;Open the file. Read it once, end to end, without editing. Notice what surprises you. The surprises are the debt.&lt;/p&gt;

&lt;p&gt;Make a list of the contradictions, the orphans, the rules whose origin you cannot reconstruct, the rules that fire on patterns the agent does not produce. The list is your audit's output before you have written a single change.&lt;/p&gt;

&lt;p&gt;Then patch ten rules. Not all of them. The first ten. The marginal token savings will be noticeable; the marginal coherence improvement will be larger. The agent will behave better the next week, in ways you can attribute to specific deletions.&lt;/p&gt;

&lt;p&gt;Schedule the next audit on your calendar before you close the file. The debt comes back. The discipline is to come back too.&lt;/p&gt;

&lt;p&gt;The harness is a document. Documents rot. The team that gardens the harness has an agent that gets better over time. The team that does not has an agent that gets worse.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Org rules and project rules need different homes</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Thu, 04 Jun 2026 19:09:51 +0000</pubDate>
      <link>https://dev.to/tacoda/org-rules-and-project-rules-need-different-homes-4n0h</link>
      <guid>https://dev.to/tacoda/org-rules-and-project-rules-need-different-homes-4n0h</guid>
      <description>&lt;p&gt;I have three repos that all want the same TODO hygiene rules. The first one got them after a review where I caught a &lt;code&gt;TODO: fix this&lt;/code&gt; with no owner. The second one picked them up by copying &lt;code&gt;CLAUDE.md&lt;/code&gt; from the first. The third one is where I noticed I was opening the second project's &lt;code&gt;CLAUDE.md&lt;/code&gt; to copy it a third time.&lt;/p&gt;

&lt;p&gt;That is the exact failure Level 3 is supposed to prevent.&lt;/p&gt;

&lt;p&gt;The taxonomy is easy enough to write down. Level 2 is the project harness; Level 3 is the organization harness. The hard part is the split inside the repo. If both kinds of rule end up in the same files, the boundary collapses every time someone edits anything. Keystone draws the line in the file tree itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The split Keystone draws
&lt;/h2&gt;

&lt;p&gt;A Keystone harness has two halves.&lt;/p&gt;

&lt;p&gt;The project half is &lt;code&gt;harness/corpus/&lt;/code&gt; and &lt;code&gt;harness/guides/&lt;/code&gt;. The team owns it. They edit it freely. It is the reasoning and the rules that describe &lt;em&gt;this codebase&lt;/em&gt;: which test runner, which lint config, which idioms, which domain constraints.&lt;/p&gt;

&lt;p&gt;The org half is &lt;code&gt;harness/policies/&lt;/code&gt;. Each policy is its own namespace, owned by whoever published it. Policies arrive through &lt;code&gt;keystone init --policy &amp;lt;ref&amp;gt;&lt;/code&gt; and update through &lt;code&gt;keystone policy update &amp;lt;name&amp;gt;&lt;/code&gt;. Local edits inside a policy namespace block the next update unless &lt;code&gt;--force&lt;/code&gt; is passed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;harness/
├── corpus/              # project reasoning (Level 2)
├── guides/              # project rules (Level 2)
├── policies/            # Level 3
│   ├── universal/       # ships with the binary
│   └── tacoda/          # installed via --policy
├── sensors/
├── adapters/
├── learning/
└── archive/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things make the boundary stick. First, every policy writes only inside its own subtree; the installer enforces that. Second, the lockfile records per-file hashes, so the update flow can tell whether a policy file has been edited locally. The structure is what keeps a Level 3 rule from quietly becoming a Level 2 rule.&lt;/p&gt;

&lt;p&gt;Sensors deliberately do not live inside a policy. Sensors describe project tooling (lint, type-check, test). A policy can declare &lt;em&gt;what&lt;/em&gt; must be checked; the project decides &lt;em&gt;how&lt;/em&gt;. That is the right cut: principles distribute, commands do not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The example: tacoda-policy
&lt;/h2&gt;

&lt;p&gt;The shape is easier to see in a real one. &lt;a href="https://github.com/tacoda/tacoda-policy" rel="noopener noreferrer"&gt;tacoda-policy&lt;/a&gt; is the example repo I use to test the layer end-to-end. It carries my personal preferences around documentation and TODOs, and it installs as a policy named &lt;code&gt;tacoda&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The repo is small on purpose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tacoda-policy/
├── keystone-policy.yaml
├── README.md
└── policy/
    └── harness/policies/tacoda/
        ├── corpus/
        │   ├── documentation.md
        │   └── todos.md
        └── guides/
            ├── documentation.md
            └── todos.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The manifest is short. Name, version, description:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tacoda&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.1.0&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Example policy for tacoda projects. Layers on top of the keystone universal&lt;/span&gt;
  &lt;span class="s"&gt;baseline with personal/team preferences around documentation conciseness&lt;/span&gt;
  &lt;span class="s"&gt;and TODO hygiene.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;policy/&lt;/code&gt; directory mirrors how the content lands inside a consumer's harness. The installer rejects anything outside the policy's own namespace (&lt;code&gt;policy/harness/policies/tacoda/&lt;/code&gt;), so this repo cannot accidentally drop files into a consumer's project tree. That guarantee is what makes "trust this policy enough to install it" a smaller ask than "trust this script enough to run it."&lt;/p&gt;

&lt;h2&gt;
  
  
  What &lt;code&gt;--policy&lt;/code&gt; does to your project
&lt;/h2&gt;

&lt;p&gt;Run &lt;code&gt;keystone init&lt;/code&gt; in a fresh repo with the policy flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;keystone init &lt;span class="nt"&gt;--policy&lt;/span&gt; git+https://github.com/tacoda/tacoda-policy.git#v0.1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get the full harness, plus the policy dropped into its namespace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;harness/policies/tacoda/
├── corpus/
│   ├── documentation.md
│   └── todos.md
└── guides/
    ├── documentation.md
    └── todos.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same shape as the project layer, scoped to &lt;code&gt;tacoda/&lt;/code&gt;. Two pairs. Each pair is a corpus file (the reasoning) and a guide file (the rules). Guides are ambient: the agent loads them every turn. Corpus is on-demand: the agent reaches a corpus file by following the forward-link in the paired guide when it needs the &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here is the documentation guide that lands in the project, trimmed to the golden rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## GOLDEN RULES&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Apply the Hemingway test to every sentence.**&lt;/span&gt; Each sentence must change
  what the reader does. If a sentence can be deleted without losing
  instruction or context, delete it.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Default to no comment.**&lt;/span&gt; Add one only when the &lt;span class="ge"&gt;*why*&lt;/span&gt; is non-obvious.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Lead with the why.**&lt;/span&gt; Documentation explains motivation and non-obvious
  behavior. Well-named identifiers explain &lt;span class="ge"&gt;*what*&lt;/span&gt;; comments explain &lt;span class="ge"&gt;*why*&lt;/span&gt;.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The corpus file holds the long-form reasoning: where the Hemingway test comes from, what gets cut, what stays, why bad docs are worse than no docs. The agent only pays for those tokens when it reaches for them.&lt;/p&gt;

&lt;p&gt;That split matters for the same reason it matters at the project layer. Guides are the cheapest tokens you can spend; corpus is on-demand context that earns its keep when the agent has to make a judgment call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authoring your own policy
&lt;/h2&gt;

&lt;p&gt;A policy is a git repo with two things at its root. The manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-org&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.1.0&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Org-wide engineering standards: vendor list, license rules, release gates.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a &lt;code&gt;policy/&lt;/code&gt; directory that mirrors the consumer layout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my-policy-repo/
├── keystone-policy.yaml
├── README.md
└── policy/
    └── harness/policies/my-org/
        ├── corpus/
        │   ├── vendors.md
        │   └── licensing.md
        └── guides/
            ├── vendors.md
            └── licensing.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;name&lt;/code&gt; in the manifest is the namespace. Every file under &lt;code&gt;policy/&lt;/code&gt; must live inside &lt;code&gt;policy/harness/policies/my-org/&lt;/code&gt;; anything outside is rejected at install time. The README and the manifest sit at the repo root for humans, not consumers, and the installer ignores them.&lt;/p&gt;

&lt;p&gt;A guide file in the policy looks the same as a project guide. Short, ambient, full of rules. The bottom line points at the corpus:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Vendors — rules&lt;/span&gt;

The rules from &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;`corpus/vendors.md`&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;../corpus/vendors.md&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;.
Loaded ambient; enforced by the &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;drift sensor&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;../../../../sensors/drift.md&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;.

&lt;span class="gu"&gt;## GOLDEN RULES&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Approved cloud vendors: AWS, GCP, Cloudflare. Anything else needs review.
&lt;span class="p"&gt;-&lt;/span&gt; Storage of customer data outside the approved list is a release blocker.
&lt;span class="p"&gt;
---
&lt;/span&gt;
Traces to: &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;`corpus/vendors.md`&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;../corpus/vendors.md&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The corpus file is where the reasoning lives: why this vendor list, what the review process looks like, the contractual constraints that drove it. Same shape as project corpus. Same load behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Updates without surprises
&lt;/h2&gt;

&lt;p&gt;The interesting part of Level 3 is not the install. It is what happens the second time.&lt;/p&gt;

&lt;p&gt;Each installed policy is pinned in &lt;code&gt;harness/.keystone.lock&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tacoda&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;source_ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;git+https://github.com/tacoda/tacoda-policy.git#v0.1.0"&lt;/span&gt;
    &lt;span class="na"&gt;resolved_sha&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abc1234..."&lt;/span&gt;
    &lt;span class="na"&gt;policy_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.1.0"&lt;/span&gt;
    &lt;span class="na"&gt;keystone_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.9.0"&lt;/span&gt;
    &lt;span class="na"&gt;files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;harness/policies/tacoda/guides/documentation.md"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha256:..."&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;harness/policies/tacoda/guides/todos.md"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha256:..."&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;harness/policies/tacoda/corpus/documentation.md"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha256:..."&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;harness/policies/tacoda/corpus/todos.md"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha256:..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bump to a new ref:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;keystone policy update tacoda &lt;span class="s1"&gt;'#v0.2.0'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or re-resolve the current ref (useful when tracking &lt;code&gt;#main&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;keystone policy update tacoda
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hashes are the safety net. If a file inside the policy namespace has been edited locally since install, the update refuses to overwrite it. You either revert the local edit or pass &lt;code&gt;--force&lt;/code&gt; and accept the loss. That refusal is what makes "policies are not project-authored" a real constraint, not a convention.&lt;/p&gt;

&lt;p&gt;If a project genuinely needs to soften or extend a policy rule, the right move is at the project layer. Add a project guide under &lt;code&gt;harness/guides/&lt;/code&gt; that traces to the policy file by path and records the deviation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Documentation — project deviation&lt;/span&gt;

This project relaxes
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;`policies/tacoda/guides/documentation.md`&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;../policies/tacoda/guides/documentation.md&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;:
multi-paragraph docstrings are permitted on public API surfaces because the
docs site is generated from them.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The policy stays unmodified. The deviation lives where future readers will look. The lockfile keeps updating cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  When a policy cannot be relaxed
&lt;/h2&gt;

&lt;p&gt;The deviation pattern works for most rules. It does not work for the rules a project is not allowed to relax. A vendor list pinned by legal is not a suggestion. A license rule that gates a release is not a starting point for negotiation.&lt;/p&gt;

&lt;p&gt;That is what &lt;code&gt;strict&lt;/code&gt; is for. The flag lives in the policy manifest and defaults to &lt;code&gt;false&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-org&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.2.0&lt;/span&gt;
&lt;span class="na"&gt;strict&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Org-wide engineering standards. Strict: project deviations do not apply.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A strict policy changes two things about how its guides reach the agent.&lt;/p&gt;

&lt;p&gt;First, load order. Project guides under &lt;code&gt;harness/guides/&lt;/code&gt; normally load before policy guides, so a later project guide can sit on top of a policy rule and the agent reads the project's relaxation last. A strict policy reverses that: its guides load last and have the final word on any rule it covers.&lt;/p&gt;

&lt;p&gt;Second, authority. The drift sensor reads &lt;code&gt;strict: true&lt;/code&gt; and marks the policy's rules as non-overridable in the loaded context. A project-layer deviation guide that traces back to a strict policy file is treated as a violation, not a softening. The agent will not quietly apply it.&lt;/p&gt;

&lt;p&gt;File-level behavior is unchanged. Local edits to a strict policy file still block updates the same way; &lt;code&gt;--force&lt;/code&gt; still works the same way. Strict is not about file ownership; it is about precedence at agent runtime.&lt;/p&gt;

&lt;p&gt;Use it sparingly. Most rules deserve the deviation door because most rules have edge cases worth respecting. Strict is for the rules whose edge cases have already been argued and lost: compliance, licensing, security posture. The flag exists so the policy author can name those rules as different in kind, not just different in topic.&lt;/p&gt;

&lt;p&gt;A strict policy is a promise to the rest of the org. Defaulting to &lt;code&gt;false&lt;/code&gt; keeps that promise rare enough to mean something.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the split is worth the trouble
&lt;/h2&gt;

&lt;p&gt;The argument against splitting Level 2 and Level 3 is that it is one more concept to hold. The argument for it is that the wrong layer is the wrong owner, and the wrong owner is how rules rot.&lt;/p&gt;

&lt;p&gt;A "use Vitest, not Jest" rule belongs to one project. A "every TODO names an owner" rule belongs to me across every project I touch. If both end up in the same &lt;code&gt;CLAUDE.md&lt;/code&gt;, the next person to edit it has to guess which rules can travel and which cannot. They will guess wrong, and the file will drift further every quarter.&lt;/p&gt;

&lt;p&gt;Keystone makes the answer structural. Project rules go in &lt;code&gt;harness/guides/&lt;/code&gt;. Org rules go in &lt;code&gt;harness/policies/&amp;lt;name&amp;gt;/guides/&lt;/code&gt;. The lockfile keeps the org rules in sync. The namespace rule keeps a policy from escaping its lane. The drift sensor enforces both layers the same way at agent runtime, because the agent does not care where a rule came from; only the humans editing the files do.&lt;/p&gt;

&lt;p&gt;That is the rule worth naming: a harness is not just a place to put rules. It is a place to put rules &lt;em&gt;where the right person can edit them&lt;/em&gt;. Level 3 is the layer that question gets answered at.&lt;/p&gt;

&lt;p&gt;The example repo is &lt;a href="https://github.com/tacoda/tacoda-policy" rel="noopener noreferrer"&gt;tacoda-policy&lt;/a&gt;. The harness installer is &lt;a href="https://github.com/tacoda/keystone" rel="noopener noreferrer"&gt;keystone&lt;/a&gt;. The thing I keep returning to: the cost of a good split is one extra directory. The cost of no split is a file you stop trusting.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Multi-agent, One Harness</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:08:39 +0000</pubDate>
      <link>https://dev.to/tacoda/multi-agent-one-harness-3bld</link>
      <guid>https://dev.to/tacoda/multi-agent-one-harness-3bld</guid>
      <description>&lt;p&gt;Half the team uses Claude Code. A third uses Cursor. Two engineers swear by Aider. One was running codex out of curiosity for a sprint. Five different agents reading what should have been five copies of the same conventions; predictably, they were not. The Claude Code users had the most complete CLAUDE.md. The Cursor users had &lt;code&gt;.cursor/rules&lt;/code&gt; files that overlapped with it. The Aider users had an &lt;code&gt;.aider.conf.yml&lt;/code&gt; plus a CONVENTIONS.md they had drifted from the others.&lt;/p&gt;

&lt;p&gt;The team was getting drift across tools, not because the engineers disagreed about how to write code, but because each agent had a slightly different copy of the rules and nobody was syncing them.&lt;/p&gt;

&lt;p&gt;The fix was the one nobody wanted to do first: a single source of truth, and a strategy for how each agent's tool-specific config pointed at it. That is the post. What goes in the shared file, what stays per-tool, and where the harness has to fork.&lt;/p&gt;

&lt;h2&gt;
  
  
  What translates cleanly
&lt;/h2&gt;

&lt;p&gt;Most of the harness is plain English. "When you touch the migrations directory, run the test suite against a real Postgres, not the mocked client." That sentence is meaningful to Claude Code, to Cursor, to Aider, to any agent that reads natural language and acts on it. The rule is portable because the rule is about the codebase, not about the tool.&lt;/p&gt;

&lt;p&gt;The convention rules port. The architectural constraints port. The escalation patterns port (mostly; the syntax for stopping varies). The "do not edit &lt;code&gt;legacy/&lt;/code&gt; without coordination" rule ports. The file-path scoping ports, since the directory structure is shared.&lt;/p&gt;

&lt;p&gt;The bulk of a healthy CLAUDE.md is portable. The shape that worked for us was to put the portable rules in a file at the project root that every agent could read, and have each tool's config point at that file.&lt;/p&gt;

&lt;p&gt;For Claude Code, that means a top-level CLAUDE.md. For Cursor, a top-level &lt;code&gt;.cursor/rules/&lt;/code&gt; file that imports the same content, or a symlink, or a build step that generates the tool-specific files from the shared source. For Aider, the CONVENTIONS.md flag points at the shared file. For codex, the equivalent config.&lt;/p&gt;

&lt;p&gt;The principle is &lt;strong&gt;shared&lt;/strong&gt; content, &lt;em&gt;per-tool&lt;/em&gt; wiring. Not five copies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does not translate
&lt;/h2&gt;

&lt;p&gt;Some rules are tool-specific in a way that does not survive the translation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slash commands and skills.&lt;/strong&gt; Claude Code has slash commands and skills; Cursor has its own custom modes; Aider has macros. The rule "run the security audit via the &lt;code&gt;/security-audit&lt;/code&gt; skill before merging" is a Claude Code-only sentence. The equivalent for the other tools is a different sentence pointing at a different command. These do not share; each tool gets its own block.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool-specific failure modes.&lt;/strong&gt; Each agent has its own bad habits. Claude Code tends toward over-eager refactors. Cursor has a different signature failure on multi-file changes. Aider's failure modes around context loading are different from either. The rules that catch tool-specific failure modes belong in tool-specific configs, because they are not relevant to the other agents and they pay a token cost wherever they live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hook syntax and integration.&lt;/strong&gt; Claude Code's settings.json hooks, Cursor's commands, Aider's pre-commit integration; the rules that reference these point at tool-specific machinery. The rule "the pre-commit hook will reject this; do not bypass it" is portable. The rule "the &lt;code&gt;verify&lt;/code&gt; skill runs the checks defined in &lt;code&gt;.claude/skills/verify/SKILL.md&lt;/code&gt;" is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context window limits.&lt;/strong&gt; Each agent has different context window math. A rule that says "if the file is more than 2,000 lines, read the section you need rather than the whole file" might be load-bearing for one tool and irrelevant for another. The rules about how the agent should manage its own context belong to the tool whose context it is.&lt;/p&gt;

&lt;p&gt;The rule of thumb: anything about the codebase shares; anything about the tool stays.&lt;/p&gt;

&lt;h2&gt;
  
  
  The forking question
&lt;/h2&gt;

&lt;p&gt;The hardest part is not the easy splits. The hardest part is the rules that should mostly share, with a small per-tool variation.&lt;/p&gt;

&lt;p&gt;The escalation ladder is the classic example. "Stop and ask before editing more than three files at once" is a portable rule. The way the agent communicates the stop ("ask via the assistant message" vs. "open a side panel" vs. "post to the chat") is tool-specific. Do you write one rule or two?&lt;/p&gt;

&lt;p&gt;The pattern that has worked: write the portable rule in the shared file, with the imperative in plain English. Each tool's config has a one-line addition that says "when this rule applies, do X in this tool" with the tool-specific mechanism.&lt;/p&gt;

&lt;p&gt;So the shared file has the rule. Each tool's file has the implementation hook. The shared file does not balloon with five tool-specific clauses; each tool's file does not duplicate the shared rule. The fork is at the smallest possible scope.&lt;/p&gt;

&lt;p&gt;This works for maybe 80% of the rules with a tool-specific edge. The remaining 20% are genuinely different enough that they get separate rules in separate files. Pretending they are the same does not save effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  The sync problem
&lt;/h2&gt;

&lt;p&gt;The naive solution is a single shared file plus tool-specific configs that all reference it. The reference can be a symlink, an include directive, a build step, or a convention. The agents will all read the same conventions content on every session. The drift is structurally impossible, because there is one file.&lt;/p&gt;

&lt;p&gt;The simpler structure wins for the same reason simpler systems usually win: the failure modes are obvious. If the shared file is missing, every agent fails the same way and you fix it once. If it is wrong, you fix it once and every agent improves on the next session.&lt;/p&gt;

&lt;h2&gt;
  
  
  The team practice
&lt;/h2&gt;

&lt;p&gt;The team practice that holds the structure together is one rule: harness changes that are portable land in the shared file. Harness changes that are tool-specific land in the tool-specific file.&lt;/p&gt;

&lt;p&gt;The PR review enforces this. Someone adding a tool-specific clause to the shared file gets pushed back to move it. Someone adding a portable clause to a tool-specific file gets pushed back to move it the other way. The discipline is small but constant.&lt;/p&gt;

&lt;p&gt;The second practice: when a new tool joins the team, the engineer onboarding it writes the tool-specific config and the team reviews it for what should be moved into the shared file. The review surfaces conventions that had been implicit and tool-specific without anyone noticing.&lt;/p&gt;

&lt;p&gt;The third practice: drift audits. Once a quarter, someone reads all the tool-specific configs and checks for portable content that has snuck in. The audit takes thirty minutes and usually surfaces two or three rules that should be moved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the harness has to fork
&lt;/h2&gt;

&lt;p&gt;There are cases where the harness genuinely has to fork. The agent's capabilities are different. The right rule for an agent that can read a lot of context at once is a different rule than for an agent that can read very little.&lt;/p&gt;

&lt;p&gt;Claude Code with the 1M-context model has a different harness than Claude Code with a smaller model would, if I were running both. The rule "read the full file before editing" is fine in the first case and unworkable in the second.&lt;/p&gt;

&lt;p&gt;If your team is running agents with very different capabilities, the harness fork is real and you cannot abstract it away. The shared file is what they have in common; the tool-specific files do real work, not just integration glue.&lt;/p&gt;

&lt;p&gt;The lesson: aim for shared, accept the fork when it is real, do not pretend a fork is shared.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the shared file should not be
&lt;/h2&gt;

&lt;p&gt;The shared file is not a dumping ground for everything anyone wanted in the harness. It is the rules that genuinely apply to any agent that touches this codebase. The bar is higher, not lower, because the file is being read by every agent on the team.&lt;/p&gt;

&lt;p&gt;I have seen teams put their entire harness in a "shared" file and then complain that the agents behave inconsistently. The agents are reading the same file; the file is not designed for what it is being asked to do. Half the rules are tool-specific and the agents that do not need them are paying the token cost and the attention cost on every session.&lt;/p&gt;

&lt;p&gt;The shared file is small and dense. Each tool's file is larger and has the tool-specific work. That inversion of the typical "shared = bigger" assumption is what makes the structure stable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keystone is agent-agnostic by design
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.tacoda.dev/keystone/" rel="noopener noreferrer"&gt;Keystone&lt;/a&gt; is the scaffolder I built around this split. The harness is one directory of markdown; the agent-specific wiring is one small file per tool.&lt;/p&gt;

&lt;p&gt;The harness lives at the project root. Four layers: corpus, guides, sensors, and flywheels. None of it knows which agent will read it. That is the design. The harness is agent-agnostic at the ground floor, not patched in after the fact.&lt;/p&gt;

&lt;p&gt;The adapter pattern handles the rest. Each agent gets a small directory under &lt;code&gt;harness/adapters/&amp;lt;agent&amp;gt;/&lt;/code&gt; with three files: &lt;code&gt;lifecycle.md&lt;/code&gt;, &lt;code&gt;sensors.md&lt;/code&gt;, and &lt;code&gt;activation.md&lt;/code&gt;. The activation file is what the agent reads first: &lt;code&gt;CLAUDE.md&lt;/code&gt; for Claude Code, &lt;code&gt;AGENTS.md&lt;/code&gt; for Codex CLI, &lt;code&gt;.cursor/rules/000-harness.mdc&lt;/code&gt; for Cursor. Its job is to point the agent at the shared harness by convention.&lt;/p&gt;

&lt;p&gt;Adapters for Claude Code, Codex CLI, and pi.dev are real. The rest ship with features gaps: a minimal lifecycle file and a working menu, enough to start. Gaps are handled with warning messages during &lt;code&gt;init&lt;/code&gt;, with suggested actions to resolve the problem. There is also a generic fallthrough. It is a contract for what an adapter must do plus a default instruction to read the shared corpus. A new agent joining the team is not a rewrite; it is filling in the stub.&lt;/p&gt;

&lt;p&gt;The win is sync. There is one directory every agent points at, by standard. If a rule changes, it changes in one place. The drift problem cannot come back because there is nowhere for it to come back from.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would tell someone running multiple agents
&lt;/h2&gt;

&lt;p&gt;Audit your tool-specific configs for content that is actually about the codebase, not the tool. Move the codebase content into one shared file. Have each tool's config reference it.&lt;/p&gt;

&lt;p&gt;Establish the discipline that portable content lands in the shared file and tool-specific content lands in the tool-specific file. Enforce it in PR review. Audit quarterly.&lt;/p&gt;

&lt;p&gt;Accept that some rules genuinely have to fork. The forks are not failures of the architecture; they are the architecture telling you the truth about your tools' differences. The teams that try to flatten the forks end up with rules that fit no agent well.&lt;/p&gt;

&lt;p&gt;The unified harness is one file the team agrees on, plus the tool-specific glue. Not one file pretending to do everything. The discipline is to keep the shared file shared and the specific files specific.&lt;/p&gt;

&lt;p&gt;The drift goes away when the source of truth is singular. The rest is wiring.&lt;/p&gt;

&lt;p&gt;And if you're interested, try out &lt;a href="https://www.tacoda.dev/keystone/" rel="noopener noreferrer"&gt;Keystone&lt;/a&gt;! 👋&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>The Harness Has a Token Budget</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Wed, 03 Jun 2026 17:39:11 +0000</pubDate>
      <link>https://dev.to/tacoda/the-harness-has-a-token-budget-gcn</link>
      <guid>https://dev.to/tacoda/the-harness-has-a-token-budget-gcn</guid>
      <description>&lt;p&gt;Our project CLAUDE.md crossed 4,000 tokens last quarter, and the agent started missing rules it had been respecting for months. Not the rules at the top. The rules in the middle. The ones buried under three other sections of guidance, the ones the agent could reach if it stayed focused but did not reach reliably when it was deep in a task.&lt;/p&gt;

&lt;p&gt;The story would have ended with "make the file shorter" except for the part where most of those rules had earned their place. Each one had an incident behind it. Each one had a reviewer who would defend it. The harness was not bloated; the harness was honest. And the agent was still missing rules.&lt;/p&gt;

&lt;p&gt;The conclusion that took me too long to reach: the harness has a token budget, and we had blown it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost nobody puts on a P&amp;amp;L
&lt;/h2&gt;

&lt;p&gt;Every CLAUDE.md line costs context. The agent reads the harness on every session, in every window, before any of your task-specific context arrives. A 4,000-token CLAUDE.md is 4,000 tokens the agent is not using for the file you actually asked it to edit.&lt;/p&gt;

&lt;p&gt;The first cost is straightforward: less window for the work. On a small task it does not matter. On a task that spans three files and a long log, it absolutely matters; the agent runs out of room and starts dropping context, which is exactly when you want every rule to be at the top of attention.&lt;/p&gt;

&lt;p&gt;The second cost is more subtle. The agent's attention is not uniform across the input. Rules at the top of the harness fire more reliably than rules buried halfway down. The middle of the file is the worst place for a load-bearing rule, and it is also where rules accumulate by default, because every new rule gets appended to the section where it logically belongs and slowly pushes the older rules further into the middle.&lt;/p&gt;

&lt;p&gt;The third cost is human. The team reads the harness when onboarding, when investigating an agent failure, when proposing a change to a rule. A long file is a file nobody reads top to bottom. The institutional memory of why each rule exists evaporates, and the next maintainer is reading three years of accumulated decisions with no map.&lt;/p&gt;

&lt;h2&gt;
  
  
  The exchange rate
&lt;/h2&gt;

&lt;p&gt;The way I think about it: each rule trades tokens for prevented mistakes. The exchange rate is the rule's worth.&lt;/p&gt;

&lt;p&gt;A rule that prevents one production incident per quarter, and costs 30 tokens to encode, is overwhelmingly worth keeping. A rule that fires on a pattern the agent never produces, and costs 150 tokens because it explains the reasoning in three paragraphs, is a bad trade.&lt;/p&gt;

&lt;p&gt;The hard part is that most rules look fine in isolation. The trade only becomes visible at the budget level. You cannot evaluate a single rule and decide whether to keep it; you have to evaluate it against the total cost of the file it lives in and ask whether you would trade it for one or two rules at the top of attention that you currently do not have.&lt;/p&gt;

&lt;p&gt;The discipline is to think in budget terms rather than line terms. Not "is this rule useful." But "would I rather have this rule, or the 200 tokens of attention it costs me on every session."&lt;/p&gt;

&lt;h2&gt;
  
  
  The four moves
&lt;/h2&gt;

&lt;p&gt;Once you accept the budget, the harness has four moves available, and they are all underused.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consolidate.&lt;/strong&gt; Two rules that say almost the same thing become one rule, half the length, applied where both used to apply. The consolidation usually surfaces redundancy nobody had noticed. The agent reads one rule instead of two and applies it more consistently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compress.&lt;/strong&gt; A rule that explains the reasoning in three paragraphs becomes a rule that states the rule in two lines. The reasoning moves to a comment in the codebase, or to the PR that introduced the rule, or to a feature doc loaded only when needed. The agent does not need the reasoning to apply the rule; the agent needs the rule.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope down.&lt;/strong&gt; A rule at the project root that only applies to one module moves to that module's CLAUDE.md. The token cost is paid when the agent is touching that module, and zero when the agent is anywhere else. The scoping piece covers this; the token-budget framing is what makes the move feel urgent rather than tidy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delete.&lt;/strong&gt; A rule whose origin nobody remembers, whose pattern the agent does not produce, and whose removal does not change behavior, is gone. The rule-lifecycle piece covers the discipline. The token budget is what makes the deletion mandatory rather than optional.&lt;/p&gt;

&lt;p&gt;The four moves are not refactors. They are accounting.&lt;/p&gt;

&lt;h2&gt;
  
  
  When inline beats CLAUDE.md
&lt;/h2&gt;

&lt;p&gt;The harness is not the only place a rule can live. The codebase has tools for encoding constraints that pay no token cost on every session.&lt;/p&gt;

&lt;p&gt;A rule about how to format dates can be a lint check. A rule about what file to import from can be enforced by &lt;code&gt;eslint-plugin-import&lt;/code&gt; or its equivalent. A rule about which directory a new component goes in can be enforced by a generator command. None of these rules need to live in CLAUDE.md, because the constraint is already encoded somewhere the agent will respect.&lt;/p&gt;

&lt;p&gt;The check I run when adding a rule: is there a place in the codebase where this rule could be enforced mechanically. If yes, the rule goes there. The harness only carries rules that have to be linguistic, the ones that depend on judgment or context the linter cannot see.&lt;/p&gt;

&lt;p&gt;The harness pays its token cost in attention. Lint pays its cost in CI minutes. Lint is cheaper on a per-rule basis, and the cost does not compound against every other rule.&lt;/p&gt;

&lt;h2&gt;
  
  
  The audit that gives the budget back
&lt;/h2&gt;

&lt;p&gt;Once a quarter, I sit with the harness open and a question in mind: where am I overpaying.&lt;/p&gt;

&lt;p&gt;The rules that bleed the most tokens are the ones with the most explanation. Five lines of context, two lines of rule, three lines of edge cases. The agent does not need the context; the team did, once, when the rule was being negotiated. The context is dead weight now. Compress it.&lt;/p&gt;

&lt;p&gt;The second-worst offenders are the rules that should be scoped. Half the rules in a typical project root CLAUDE.md apply to a single module. Each one pays its token cost on every session, including the ones where the agent is nowhere near that module. Scope them down. The agent's attention on the API task is not paying for the frontend rule.&lt;/p&gt;

&lt;p&gt;The third batch is the simplest: rules whose origin nobody remembers and whose pattern the agent does not produce. They go.&lt;/p&gt;

&lt;p&gt;Each quarter the audit returns somewhere between 800 and 1,500 tokens to the budget. The agent gets better. The team can read the file again.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of a budgeted harness
&lt;/h2&gt;

&lt;p&gt;Imagine you had to fit the entire harness on a single screen of your editor. No scrolling. What rules would survive.&lt;/p&gt;

&lt;p&gt;The exercise sounds artificial. It is not. The rules that would survive are the rules that are doing the most work. The rules that would get cut are the rules that look useful but are mostly paying for themselves through inertia.&lt;/p&gt;

&lt;p&gt;You will not actually shrink the harness to one screen. The point of the exercise is to find out which rules pass the test and which ones are surviving because nobody has asked them to justify themselves.&lt;/p&gt;

&lt;p&gt;The budgeted harness is not the smallest harness. It is the one where every rule has earned its tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keystone budgets context by design
&lt;/h2&gt;

&lt;p&gt;Keystone is the open source tool I built for harness engineering, and the budget framing is wired into how it loads context. The design starts from one decision: context is a scarce resource, and the harness has to declare what gets to live in it at each moment.&lt;/p&gt;

&lt;p&gt;Keystone splits the harness into three tiers, each with its own budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Always-on guides.&lt;/strong&gt; Short rule files that load on every session. The whole guide layer on a fresh 0.7.0 install runs about 28K tokens across 53 files. That is the ambient cost: the rules the agent has to see every time it picks up work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-demand corpus.&lt;/strong&gt; The reasoning behind a rule, the long examples, the historical context: none of that lives in the always-on layer. It sits in corpus files the agent loads only when a guide forward-links into one. About 1-3K tokens per file. Most tasks never touch them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transient sensors.&lt;/strong&gt; Lint output, test output, audit output. These appear in context only during verification and then drop out. About 1-5K per verify cycle, carrying signal without leaving residue.&lt;/p&gt;

&lt;p&gt;The math works the way the budget framing predicts. A fresh install costs roughly 14% of a 200K window, or 3% of a 1M window. A worst-case full audit, pulling in every corpus file and every sensor at once, lands around 45-55K tokens. The agent still has most of its context free for the work itself.&lt;/p&gt;

&lt;p&gt;The tiering forces the same question the four moves ask: &lt;em&gt;what tier does this rule belong in&lt;/em&gt;? A rule that fires on every task earns its always-on slot. A rule that explains a decision once a quarter belongs in corpus, behind a link. A check that runs only during verify lives in sensors and leaves no residue.&lt;/p&gt;

&lt;p&gt;Keystone lives at &lt;a href="https://www.tacoda.dev/keystone/" rel="noopener noreferrer"&gt;www.tacoda.dev/keystone&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The principle is the one this whole post argues for: every token in the always-on layer is paid every session, so the always-on layer has to be small, deliberate, and audited. The tool is the discipline made mechanical.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would tell someone running the harness today
&lt;/h2&gt;

&lt;p&gt;Open your CLAUDE.md. Read it top to bottom, the way the agent reads it, on the assumption that attention falls off the further down you get.&lt;/p&gt;

&lt;p&gt;Find the rule with the worst exchange rate. Probably it explains itself in five lines, lives in the project root when it should live in a module, and prevents a pattern the agent has not produced in six months. Compress it, move it, or delete it.&lt;/p&gt;

&lt;p&gt;Do that for ten rules. Measure the token savings. The savings will surprise you. The agent's behavior will not get worse. In most cases, it will get better, because the rules that survived will fire more reliably than they did when buried.&lt;/p&gt;

&lt;p&gt;The harness is not free. The token budget is real. The discipline is to spend it on what matters.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Pair Programming with an Agent</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Wed, 03 Jun 2026 14:13:07 +0000</pubDate>
      <link>https://dev.to/tacoda/pair-programming-with-an-agent-1109</link>
      <guid>https://dev.to/tacoda/pair-programming-with-an-agent-1109</guid>
      <description>&lt;p&gt;Pair programming, in its original sense, was a practice between two humans: one driving, one navigating, both engaged with the same problem at the same time. The driver typed. The navigator watched, asked questions, caught mistakes, thought about what came next. They switched roles. The discipline was that neither could check out — disengagement defeated the point.&lt;/p&gt;

&lt;p&gt;The practice translates better to working with an agent than people usually expect, and the translation reveals what is actually at stake.&lt;/p&gt;

&lt;h2&gt;
  
  
  The risk
&lt;/h2&gt;

&lt;p&gt;Working with a stronger pair partner has always carried a specific risk: you stop thinking. The other person is faster, more confident, more likely to know the answer. The path of least resistance is to type what they say and stop forming your own opinions. The result is short-term productivity and long-term skill atrophy. You finish the task. You also finish the task without learning anything, and the next time you face a similar problem alone, you discover that you cannot.&lt;/p&gt;

&lt;p&gt;This was already a known dynamic with junior-senior pairing. With agents, the asymmetry is sharper. The agent is faster than any pair partner you have worked with. It is more confident, by virtue of being a system that does not hedge. It will produce a full answer to almost any question you ask. The temptation to type what it says and move on is correspondingly stronger.&lt;/p&gt;

&lt;p&gt;The risk is not that the agent will produce bad code. The risk is that you will accept good code without understanding it, and that this becomes the default, and that one year in you are shipping work whose internals you cannot explain.&lt;/p&gt;

&lt;p&gt;The practice of pair programming was, in part, a defense against exactly this dynamic. The defense translates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Driver and navigator
&lt;/h2&gt;

&lt;p&gt;In a human pair, the driver types and the navigator watches. The roles are deliberate. The navigator's job is to ask questions, hold the larger picture, catch the mistake the driver is too close to see.&lt;/p&gt;

&lt;p&gt;In a session with an agent, the agent is the driver. It is the one producing code. You are the navigator. This is the right framing because it puts the cognitive work in the right place: the navigator is not idle. The navigator is the one thinking about whether the direction is correct, whether the change fits the system, whether the test actually tests the thing.&lt;/p&gt;

&lt;p&gt;The failure mode is to be a driver without a navigator. You sit in the seat the agent should have, asking it to type out what you tell it to. This makes you the bottleneck on what you already know how to do, while the agent produces less interesting output than it could. Worse, it leaves no one in the navigator seat. The car is moving and nobody is looking at the road.&lt;/p&gt;

&lt;p&gt;The better arrangement is to let the agent drive — let it propose, generate, sketch — while you hold the navigator's responsibilities. Where are we going? Does this turn make sense? Are we still solving the problem we started with? The agent will not ask these questions. They are yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Asking why
&lt;/h2&gt;

&lt;p&gt;A navigator's most valuable habit is asking why. Why are we using this library? Why is this function in this file? Why does the test check that and not the other thing? The questions are not adversarial. They are how the navigator stays oriented.&lt;/p&gt;

&lt;p&gt;With an agent, this habit is preserved by asking the agent its reasoning before accepting its output. Not "do this" and then read the diff; "what would you do here, and why?" and then read the answer. The why is what reveals whether the agent has understood the problem or pattern-matched to a superficial similarity.&lt;/p&gt;

&lt;p&gt;This adds friction. That is the point. The friction is the discipline. An agent that has to explain its choices is an agent whose choices are easier to evaluate. A you who is asking why is a you who is staying engaged.&lt;/p&gt;

&lt;p&gt;If you stop asking why, you stop being the navigator. The session continues, but the work is happening without supervision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading the diff before accepting it
&lt;/h2&gt;

&lt;p&gt;The single most important habit, and the one most often skipped: reading the diff before accepting it.&lt;/p&gt;

&lt;p&gt;This sounds obvious. In practice, the rhythm of working with an agent makes it easy to skip. The agent produces a change. You glance at it. It looks right. You accept it. The next prompt is already forming. The diff lives in the codebase now, and you have not actually understood what it did.&lt;/p&gt;

&lt;p&gt;Multiply this by ten interactions a day and you are shipping a meaningful fraction of your output without having read it. The atrophy is silent at first. The bugs that surface from this are bugs you cannot diagnose, because you do not have the model in your head of how the code is supposed to work.&lt;/p&gt;

&lt;p&gt;The fix is the boring one: read the diff. Every time. Even when you trust the agent. Especially when you trust the agent. If the diff is too large to read, the right move is to ask for a smaller diff, not to accept the large one unread.&lt;/p&gt;

&lt;p&gt;The reading is not just verification. It is the part of the session where you build the mental model. Skipping it gives up the learning. The learning is most of the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  The role you can never give up
&lt;/h2&gt;

&lt;p&gt;There is one job the agent cannot do for you, and that is hold the original intent.&lt;/p&gt;

&lt;p&gt;You started a session with a problem to solve. Three exchanges in, the conversation has wandered. The agent has proposed a refactor, suggested a related improvement, identified an adjacent bug. Each suggestion is reasonable in isolation. Together, they have moved you ten degrees off course. The PR you end up with solves a different problem from the one you started with, and the original problem is still there.&lt;/p&gt;

&lt;p&gt;The navigator's last job is to keep the destination in view. The agent will optimize the local terrain. You hold the map. If the suggestions are pulling away from the goal, you bring them back. If a tangent is more interesting than the original task, you make that a deliberate decision rather than a drift.&lt;/p&gt;

&lt;p&gt;This is the role that does not delegate. Whatever else the agent takes off your plate, the holding-of-intent stays. A pair programming session where neither party remembers the goal is a session that produces motion without progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  Staying sharp
&lt;/h2&gt;

&lt;p&gt;The practices add up to something simple: stay engaged. Read the diff. Ask why. Hold the goal. Switch off when you notice yourself accepting work you do not understand.&lt;/p&gt;

&lt;p&gt;These are the same practices that made human pair programming valuable. They were never about typing the right thing. They were about the discipline of two minds on the same problem, neither one allowed to coast. With an agent, you are one of the two minds, and the other one will never get tired. The risk is that you will, and that you will start coasting without noticing.&lt;/p&gt;

&lt;p&gt;The hedge against this is the practice. Pair programming was always a discipline. With an agent in the other seat, it is the only discipline that keeps you in shape.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>How to Write a Ticket an Agent Can Act On</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Tue, 02 Jun 2026 18:21:14 +0000</pubDate>
      <link>https://dev.to/tacoda/how-to-write-a-ticket-an-agent-can-act-on-52ic</link>
      <guid>https://dev.to/tacoda/how-to-write-a-ticket-an-agent-can-act-on-52ic</guid>
      <description>&lt;p&gt;The ticket said "fix the login bug." The agent shipped a 400-line PR that touched the password reset flow, the session middleware, and a feature flag I had not heard of. None of the changes were technically wrong. None of them fixed the bug.&lt;/p&gt;

&lt;p&gt;The bug was a typo in an error message.&lt;/p&gt;

&lt;p&gt;The agent did exactly what the ticket asked. The ticket asked for a vague thing, so the agent did a vague thing. The fault was not in the model. The fault was in the spec.&lt;/p&gt;

&lt;p&gt;This is the failure mode most teams write off as "the agent got confused." The agent did not get confused. The agent treated the ticket as a contract and delivered against it. The contract was bad.&lt;/p&gt;

&lt;h2&gt;
  
  
  The literal-minded reader
&lt;/h2&gt;

&lt;p&gt;A human picking up that ticket would have done one of two things. They would have asked which login bug, or they would have looked at recent bug reports and figured it out from context. Either way, the gap between "fix the login bug" and the actual fix gets closed by judgment, intuition, and the cheap cost of a Slack message.&lt;/p&gt;

&lt;p&gt;The agent has none of that. It has the ticket, the codebase, and whatever context the harness loads. It does not know which login bug. It does not have a relationship with the bug reporter. It does not pattern-match to "the typo Sarah mentioned in standup." It reads what is on the page, makes a reasonable interpretation, and goes.&lt;/p&gt;

&lt;p&gt;If the ticket leaves room for interpretation, the agent will interpret. If the interpretation is wrong, the PR will be wrong. The cost of a vague ticket has gone up, because the consumer is no longer a human who can ask.&lt;/p&gt;

&lt;p&gt;The shape of the fix is to write tickets that do not require interpretation.&lt;/p&gt;

&lt;h2&gt;
  
  
  A ticket as a contract
&lt;/h2&gt;

&lt;p&gt;A contract has a few properties. It names what is in scope. It names what is out of scope. It names the test for success. It names the parties and the constraints.&lt;/p&gt;

&lt;p&gt;A good agent ticket has the same shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What changes.&lt;/strong&gt; Name the behavior that will be different after the ticket is done. Not "fix the bug." "The error message on the login page should read 'Invalid email or password' instead of 'Invalid emial or password.'" The change is observable. The agent can verify it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What stays the same.&lt;/strong&gt; Name the things you do not want touched. The password reset flow, the session middleware, the feature flag system. The agent reads the negatives and stays out. Without them, the agent treats every related file as fair game, and a 10-line fix becomes a 400-line PR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The acceptance criteria.&lt;/strong&gt; A list of conditions, each verifiable. The error message displays correctly. The existing tests still pass. A new test asserts on the corrected string. No other files are modified. The agent can read the list, do the work, and check each item against the diff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The shape of the test.&lt;/strong&gt; This is the part most teams skip. If the test were to exist, what would it assert? The agent does not have to guess at the verification strategy. The strategy is in the ticket.&lt;/p&gt;

&lt;p&gt;A ticket with those four pieces is a ticket the agent can finish without asking. The interpretation has been moved from the agent's reasoning to the ticket itself, which is where it belongs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What out-of-scope buys you
&lt;/h2&gt;

&lt;p&gt;The single highest-leverage section of an agent ticket is the explicit out-of-scope list. It is also the section that feels most awkward to write, because it reads like distrust.&lt;/p&gt;

&lt;p&gt;It is not distrust. It is precision.&lt;/p&gt;

&lt;p&gt;When the ticket does not say "leave the session middleware alone," the agent does a quick scan, notices a thing it could improve, and improves it. From the agent's point of view, this is helpful. From the reviewer's point of view, this is a 300-line diff hiding a 10-line bug fix.&lt;/p&gt;

&lt;p&gt;The out-of-scope section is the cheapest way to bound the blast radius. Five lines in the ticket prevent two hours in review. The line "do not modify files outside &lt;code&gt;src/auth/login.tsx&lt;/code&gt;" carries more weight than any in-CLAUDE.md rule, because it is task-specific. The harness cannot guess where the boundary lives. The ticket can.&lt;/p&gt;

&lt;p&gt;I have started writing out-of-scope sections for every non-trivial agent task, even when the in-scope is obvious. The discipline forces me to think about what the change is &lt;em&gt;not&lt;/em&gt;, which often surfaces hidden assumptions before the agent does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting from the test
&lt;/h2&gt;

&lt;p&gt;The strongest tickets I write start from the test, not the feature.&lt;/p&gt;

&lt;p&gt;A test names the behavior in the language of cause and effect. "When the user submits the login form with an invalid email, the page displays 'Invalid email or password' in a red alert below the form." There is no room for interpretation. The test fails or passes. The agent's job is to make it pass.&lt;/p&gt;

&lt;p&gt;If I cannot articulate the test, I do not yet know what I am asking for. That is the signal to stop and figure it out before I write the ticket. Writing a vague ticket and hoping the agent will refine it is how the 400-line PRs happen.&lt;/p&gt;

&lt;p&gt;Starting from the test also has a second benefit. The test usually points at exactly one file, or at most a handful. The scope falls out of the test naturally. The agent does not need to be told to stay in scope, because the scope is already defined by what the test exercises.&lt;/p&gt;

&lt;h2&gt;
  
  
  The file paths matter
&lt;/h2&gt;

&lt;p&gt;Every ticket I write now includes the file paths. Not "the login form." &lt;code&gt;src/auth/login.tsx&lt;/code&gt;. Not "the session logic." &lt;code&gt;src/middleware/session.ts:34-62&lt;/code&gt;. The agent uses the path as both an instruction and a boundary. The work happens here. Nowhere else.&lt;/p&gt;

&lt;p&gt;The cost of including the path is twenty seconds of poking around the repo before opening the ticket. The cost of not including it is fifteen minutes of the agent searching, plus the risk that the agent finds the wrong file and works there for the rest of the session.&lt;/p&gt;

&lt;p&gt;This is one of those discoveries that feels embarrassingly obvious in retrospect. The harness can encode a thousand conventions. None of them are as load-bearing as the line that says "edit this file."&lt;/p&gt;

&lt;h2&gt;
  
  
  The format I have settled on
&lt;/h2&gt;

&lt;p&gt;The shape that works for me, after a few months of iteration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gs"&gt;**What changes**&lt;/span&gt;
[One or two sentences naming the observable behavior change.]

&lt;span class="gs"&gt;**Where**&lt;/span&gt;
[File paths. Specific. Including line numbers when relevant.]

&lt;span class="gs"&gt;**Out of scope**&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [File or area not to touch]
&lt;span class="p"&gt;-&lt;/span&gt; [Refactor not to do]
&lt;span class="p"&gt;-&lt;/span&gt; [Side improvement not to ship]

&lt;span class="gs"&gt;**Acceptance criteria**&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [Verifiable condition]
&lt;span class="p"&gt;-&lt;/span&gt; [Verifiable condition]
&lt;span class="p"&gt;-&lt;/span&gt; [Test that should exist]

&lt;span class="gs"&gt;**Notes**&lt;/span&gt;
[Anything the agent needs that is not in the codebase: a link to a related ticket, a constraint from a stakeholder, a deadline that affects the approach.]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a heavy template. It is five sections and most are short. A typical ticket runs twenty lines. A complex one runs forty. Either is shorter than the PR review when the spec was vague.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes about how I work
&lt;/h2&gt;

&lt;p&gt;Writing tickets like this is more work up front. The work pays for itself the first time, every time. The PR comes back in scope. The review takes ten minutes, not an hour. The change is the change I asked for, not the change I have to redirect three times to get to.&lt;/p&gt;

&lt;p&gt;The discipline is also good for me. Half the time, writing the spec surfaces that I do not actually know what I want. That is information. It is cheaper to learn it before the agent has spent forty minutes producing the wrong PR than after.&lt;/p&gt;

&lt;p&gt;The agent is a literal-minded reader. The fix is not to make the agent less literal. The fix is to write tickets that are precise enough to deserve a literal reading. That is harder for the writer. It is faster for everyone else.&lt;/p&gt;

&lt;p&gt;The contract is the unlock.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>productivity</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Keystone: Project Harness Installer</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:06:42 +0000</pubDate>
      <link>https://dev.to/tacoda/keystone-project-harness-installer-2b7m</link>
      <guid>https://dev.to/tacoda/keystone-project-harness-installer-2b7m</guid>
      <description>&lt;p&gt;I built a new tool. It is a Level 2 project harness installer (blurs into Level 3). The page is at &lt;a href="https://www.tacoda.dev/keystone/" rel="noopener noreferrer"&gt;tacoda.dev/keystone&lt;/a&gt;. The repo is at &lt;a href="https://github.com/tacoda/keystone" rel="noopener noreferrer"&gt;github.com/tacoda/keystone&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;keystone init&lt;/code&gt; drops a markdown harness into your repo. Five layers: principles, idioms, domain, state, process. Plus an activation file the agent reads first: &lt;code&gt;CLAUDE.md&lt;/code&gt; for Claude Code, &lt;code&gt;AGENTS.md&lt;/code&gt; for Codex CLI, &lt;code&gt;.cursor/rules/000-harness.mdc&lt;/code&gt; for Cursor.&lt;/p&gt;

&lt;p&gt;After init the binary is done. The harness is yours. It is plain markdown checked into your repo, edited by you, owned by your team. Keystone is not a runtime dependency: it is a scaffolder that hands you the artifacts and walks away.&lt;/p&gt;

&lt;p&gt;Install is one binary. Either:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;tacoda/tap/keystone
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or the curl bootstrap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/tacoda/keystone/main/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;keystone init&lt;/code&gt; runs once per project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The harness as a corpus, not a config
&lt;/h2&gt;

&lt;p&gt;A config file tells the agent what to do. A corpus tells the agent what kind of code this is. The difference is the difference between a checklist and a vocabulary.&lt;/p&gt;

&lt;p&gt;The five layers answer five different questions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Principles.&lt;/strong&gt; What does good engineering look like, regardless of stack?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idioms.&lt;/strong&gt; How does &lt;em&gt;this&lt;/em&gt; stack express those principles?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain.&lt;/strong&gt; What business rules constrain this codebase?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State.&lt;/strong&gt; What is true about the codebase right now?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process.&lt;/strong&gt; What happens at each phase of the workflow?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Principles travel between projects. Idioms scope to your stack. Domain and state get written by you and the agent together during a bootstrap pass. Process holds the six phases: spec, planning, implementation, verification, review, release.&lt;/p&gt;

&lt;p&gt;The agent reads what is relevant. The harness does not load all at once: process files load when you enter that phase; idiom files load lazily by region. That is the discipline. A monolithic &lt;code&gt;CLAUDE.md&lt;/code&gt; paid in tokens every turn is the failure mode this is built to avoid.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two flywheels keep the harness alive
&lt;/h2&gt;

&lt;p&gt;The harness is not static. It grows.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Learning&lt;/strong&gt; flywheel runs after merge. The agent names a pattern from what changed and proposes an addition to &lt;code&gt;learning/&lt;/code&gt;. You decide whether to promote it into the corpus.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Pruning&lt;/strong&gt; flywheel runs on audit. Rules that no longer apply get archived, with the reasoning preserved. The corpus does not calcify into a list of rules that made sense six months ago and now just get in the way.&lt;/p&gt;

&lt;p&gt;These are the moves I have been writing about for months. Keystone is what they look like in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does not do
&lt;/h2&gt;

&lt;p&gt;It does not run. There is no daemon, no hook, no background process. The binary scaffolds and exits.&lt;/p&gt;

&lt;p&gt;It does not pick your agent for you. Detection reads marker files already in your repo (&lt;code&gt;CLAUDE.md&lt;/code&gt;, &lt;code&gt;AGENTS.md&lt;/code&gt;, &lt;code&gt;.cursor/&lt;/code&gt;, and so on). It can be selected interactively or if you pass &lt;code&gt;--agent&lt;/code&gt;. The harness is agent-agnostic; only the activation file and the adapter directory change per agent.&lt;/p&gt;

&lt;p&gt;It does not lock you in. The binary is gone after init. The harness is markdown. If keystone disappeared tomorrow your project would not notice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the adapters stand
&lt;/h2&gt;

&lt;p&gt;The Claude Code, Codex CLI, and pi.dev adapters are real. Cursor, Aider, Copilot CLI, Continue, Cline, and Goose ship as stubs: a minimal lifecycle file and a working menu, enough to start. If you write a real adapter for an agent currently listed as a stub, contribute it back. The shape is documented in &lt;code&gt;harness/adapters/README.md&lt;/code&gt;: a &lt;code&gt;lifecycle.md&lt;/code&gt;, &lt;code&gt;sensors.md&lt;/code&gt;, and &lt;code&gt;activation.md&lt;/code&gt; per agent, plus a target directory under &lt;code&gt;targets/&amp;lt;agent&amp;gt;/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The page is at &lt;a href="https://www.tacoda.dev/keystone/" rel="noopener noreferrer"&gt;tacoda.dev/keystone&lt;/a&gt;. The repo is at &lt;a href="https://github.com/tacoda/keystone" rel="noopener noreferrer"&gt;github.com/tacoda/keystone&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The thing I keep returning to: the harness is the artifact, not the tool. Keystone is the cheapest way I know to put a good harness in place. After that, the project owns it.&lt;/p&gt;

</description>
      <category>tooling</category>
      <category>ai</category>
      <category>agents</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Onions and Filters</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Tue, 02 Jun 2026 14:07:57 +0000</pubDate>
      <link>https://dev.to/tacoda/onions-and-filters-355d</link>
      <guid>https://dev.to/tacoda/onions-and-filters-355d</guid>
      <description>&lt;p&gt;When I started building my first harness around a coding agent, I did not picture an onion. I pictured a constraint system.&lt;/p&gt;

&lt;p&gt;The LLM, on its own, can do almost anything. It can write code, hallucinate APIs, edit the wrong file, run a shell command in a directory it should not be in, decide a test failure is acceptable and move on. The space of things it might do on any given turn is enormous. The job of the harness, the way I thought about it, was to shrink that space.&lt;/p&gt;

&lt;p&gt;That is the framing I learned in math. You start with a set; you add conditions; the set gets smaller until what remains is what you actually want. Each filter is a predicate. The harness is a sequence of filters.&lt;/p&gt;

&lt;p&gt;Birgitta Böeckeler has been doing the most work I have seen to popularize the term &lt;em&gt;harness engineering&lt;/em&gt; and to give it a working vocabulary. Her mental model is the onion: the agent at the center, with concentric layers of harness around it, each one closer or further from the model's reasoning loop. Tools, context, hooks, sandboxes, observability. The model sits in the middle and reaches out through the layers; the layers stand between the model and the world.&lt;/p&gt;

&lt;p&gt;I like the onion. It is a good teaching shape because it gives you somewhere to point. "That belongs in this layer, not that one." "This hook fires here." But the onion is not the model I reach for when I am deciding what to add.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two pictures of the same machine
&lt;/h2&gt;

&lt;p&gt;The onion and the filter system describe the same artifact from opposite directions.&lt;/p&gt;

&lt;p&gt;The onion looks outward from the model. Each layer is something you wrap around the agent so it can do its job: a tool surface, a system prompt, a sandbox, a review step. The vocabulary is additive. You &lt;em&gt;give&lt;/em&gt; the agent tools. You &lt;em&gt;provide&lt;/em&gt; context. You &lt;em&gt;equip&lt;/em&gt; it.&lt;/p&gt;

&lt;p&gt;The filter looks inward at the output. Each filter is a predicate over what the agent could do but should not. You take the full space of agent behaviors and shave off everything that does not survive the predicate. A sandbox is a filter: only filesystem operations inside this directory survive. A type check is a filter: only diffs that compile survive. A required review is a filter: only changes the reviewer agrees with survive.&lt;/p&gt;

&lt;p&gt;Same machine, opposite framing. The onion is about what you add. The filter is about what you remove.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the framing matters
&lt;/h2&gt;

&lt;p&gt;The two framings are equivalent in what they can describe, but they push you toward different decisions.&lt;/p&gt;

&lt;p&gt;When I think in onion layers, I think about &lt;em&gt;capabilities&lt;/em&gt;. "Does the agent have the tool it needs? Does it have the context to use it well?" The instinct is to add. Another tool, another hook, another piece of context loaded into the prompt.&lt;/p&gt;

&lt;p&gt;When I think in filters, I think about &lt;em&gt;constraints&lt;/em&gt;. "What is the agent currently allowed to do that it should not be? What slips through?" The instinct is to remove. A tighter sandbox, a stricter pre-commit, a smaller allowlist, a narrower file scope on the rule that keeps firing in the wrong place.&lt;/p&gt;

&lt;p&gt;Both instincts are right at different moments. An under-capable agent needs more tools. An over-capable agent needs more filters. Most harnesses I have seen fail in the second direction, not the first: the agent has plenty of capability and not enough constraint, and the symptom is that it does the wrong thing confidently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The math habit
&lt;/h2&gt;

&lt;p&gt;The reason I default to the filter framing is that I learned to think this way before I ever wrote a harness.&lt;/p&gt;

&lt;p&gt;In a math problem, you do not list the elements of the answer set. You write the set. Then you add conditions. The integers, then "positive", then "less than 100", then "prime". The answer is whatever survives every condition.&lt;/p&gt;

&lt;p&gt;A harness is the same shape. The set is "things this agent could output". The conditions are the filters you stack. The answer, on any given turn, is whatever output survives all of them. You do not enumerate good behavior; you constrain bad behavior out.&lt;/p&gt;

&lt;p&gt;This is also why a harness composed of filters is easier to reason about than one composed of layers. Filters compose by conjunction: each one is independently true or false of a given output. If something bad gets through, you ask which filter failed, or which filter was missing. If something good gets blocked, you ask which filter is too strict. The debugging move is local.&lt;/p&gt;

&lt;p&gt;Layers, by contrast, have a position. You have to argue about which layer a new piece of behavior belongs to. Is the validation a tool concern, a sandbox concern, or a review concern? The onion gives you geography, and geography invites turf wars.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the onion still wins
&lt;/h2&gt;

&lt;p&gt;The onion is the better picture when you are explaining the harness to someone who has not built one.&lt;/p&gt;

&lt;p&gt;People understand layers. People understand "the agent sits in the middle and the world is outside". The onion makes it obvious that the agent does not see the world directly, that everything it does passes through something you control. That intuition is load-bearing for anyone new to the idea, and the filter picture is too abstract to carry it.&lt;/p&gt;

&lt;p&gt;The onion is also better when you are drawing the architecture: where does this hook fire, what does it see, who reads its output. Position matters there. The onion gives you a way to put it on a whiteboard.&lt;/p&gt;

&lt;p&gt;But once the architecture is in place and you are tuning the harness day to day, the question is almost always a filter question. What is the agent doing that I do not want? Which constraint, added or tightened, stops it? The work is subtractive even when the picture is additive.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule is constrain, not add
&lt;/h2&gt;

&lt;p&gt;The unlock for me was realizing that almost every harness decision I make is a decision about constraints, even when it does not look like one.&lt;/p&gt;

&lt;p&gt;A new tool seems additive, but the interesting design question is what the tool &lt;em&gt;cannot&lt;/em&gt; do, what arguments it refuses, what state it will not touch. A new piece of context seems additive, but the question is what behavior it filters out by being present. A new agent in the review pipeline is a filter on the diffs that reach me.&lt;/p&gt;

&lt;p&gt;The onion tells you where the piece goes. The filter tells you what the piece is &lt;em&gt;for&lt;/em&gt;. I want both pictures available, but when I am writing a rule or adding a hook, the filter is the one I am holding in my head.&lt;/p&gt;

&lt;p&gt;The harness does not give the agent power. The agent already has the power. The harness decides what the agent is allowed to do with it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I think this is a conversation the field needs to have. This is my first proposal - not as inflexible paradigm, but as a first attempt to move this conversation forward. I would love to hear any feedback you have!</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Mon, 01 Jun 2026 14:39:37 +0000</pubDate>
      <link>https://dev.to/tacoda/i-think-this-is-a-conversation-the-field-needs-to-have-this-is-my-first-proposal-not-as-3a1m</link>
      <guid>https://dev.to/tacoda/i-think-this-is-a-conversation-the-field-needs-to-have-this-is-my-first-proposal-not-as-3a1m</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/tacoda/the-harness-stack-4a7d" class="crayons-story__hidden-navigation-link"&gt;The Harness Stack&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/tacoda" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F171498%2Fb1207a6e-f740-43c4-bb64-c675e3b3ce1d.jpeg" alt="tacoda profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/tacoda" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Ian Johnson
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Ian Johnson
                
              
              &lt;div id="story-author-preview-content-3791230" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/tacoda" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F171498%2Fb1207a6e-f740-43c4-bb64-c675e3b3ce1d.jpeg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Ian Johnson&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/tacoda/the-harness-stack-4a7d" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jun 1&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/tacoda/the-harness-stack-4a7d" id="article-link-3791230"&gt;
          The Harness Stack
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag crayons-tag--filled  " href="/t/discuss"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;discuss&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/agents"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;agents&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/webdev"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;webdev&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/programming"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;programming&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/tacoda/the-harness-stack-4a7d" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/fire-f60e7a582391810302117f987b22a8ef04a2fe0df7e3258a5f49332df1cec71e.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;4&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/tacoda/the-harness-stack-4a7d#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              21&lt;span class="hidden s:inline"&gt; comments&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            8 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>ai</category>
      <category>architecture</category>
      <category>discuss</category>
      <category>llm</category>
    </item>
    <item>
      <title>The Harness Stack</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Mon, 01 Jun 2026 14:33:04 +0000</pubDate>
      <link>https://dev.to/tacoda/the-harness-stack-4a7d</link>
      <guid>https://dev.to/tacoda/the-harness-stack-4a7d</guid>
      <description>&lt;p&gt;Ask five developers what an "agent harness" is and you will get five different answers. Some mean the model. Some mean a &lt;code&gt;CLAUDE.md&lt;/code&gt; file. Some mean orchestration infrastructure. Everyone is building something real. But without shared vocabulary, we cannot learn from each other, cannot reason across systems, cannot even agree on where a problem lives when something goes wrong.&lt;/p&gt;

&lt;p&gt;That is where we are with AI agent configuration. The word &lt;em&gt;harness&lt;/em&gt; is everywhere, and it means everything. Which is another way of saying it means nothing precise enough to be useful.&lt;/p&gt;

&lt;p&gt;This is not a minor inconvenience. In a field this young, the words we settle on shape the mental models we build. And mental models shape what we think to build next. Naming things carefully is an act of collective infrastructure.&lt;/p&gt;

&lt;p&gt;This post proposes a taxonomy: &lt;strong&gt;The Harness Stack&lt;/strong&gt;. Five named harnesses, each with a clear scope and responsibility. It is not prescriptive. You do not need all five. It is a shared map, offered as a starting point for a conversation the field needs to have.&lt;/p&gt;




&lt;h2&gt;
  
  
  The harness defined
&lt;/h2&gt;

&lt;p&gt;A harness is the deliberately shaped configuration around an AI coding agent: everything that sits between the raw model and the work it does.&lt;/p&gt;

&lt;p&gt;It spans the tool you chose, the global preferences that travel with you, the project-level scaffolding inside a codebase, the cross-project conventions an organization shares, and the orchestration that coordinates multiple agents at once.&lt;/p&gt;

&lt;p&gt;A harness is not the agent. It is not the code the agent edits. It is the context that decides how the agent behaves when it encounters a task.&lt;/p&gt;




&lt;h2&gt;
  
  
  The five harnesses
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Model Harness
&lt;/h3&gt;

&lt;p&gt;The AI coding tool itself. Claude Code, Cursor, Copilot, Pi, whatever you are running.&lt;/p&gt;

&lt;p&gt;This is the product layer: the capabilities, interfaces, and built-in behaviors the tool ships with. You do not configure the Model Harness. You choose it. And that choice matters more than it might seem, because everything above it is built on assumptions the tool makes about how agents should work, what context they can hold, what hooks they expose.&lt;/p&gt;

&lt;p&gt;The discipline worth cultivating here is loose coupling. Your higher-level configuration should not be written &lt;em&gt;for&lt;/em&gt; a specific tool. It should be written for a &lt;em&gt;class&lt;/em&gt; of tools that the Model Harness happens to satisfy today. We are not quite at the point where swapping models is frictionless, but designing toward that portability now is an investment that compounds.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Agent Harness
&lt;/h3&gt;

&lt;p&gt;How the tool is configured globally, across all your work, not just one project.&lt;/p&gt;

&lt;p&gt;This is where memory lives, along with persistent preferences, user-level settings, and the context that travels with you from codebase to codebase. In Claude Code, this is your global &lt;code&gt;CLAUDE.md&lt;/code&gt;. In claude.ai, it is memory and system-level instructions. The Agent Harness answers a deceptively important question: how is this agent configured to behave before it encounters any specific project?&lt;/p&gt;

&lt;p&gt;The distinction between the Model Harness and the Agent Harness is easy to collapse and important to preserve. The tool is what it ships as. The agent is what you have made of it. That gap, between default behavior and deliberately shaped behavior, is where a surprising amount of leverage lives. An agent that understands your preferred coding style, your tolerance for verbosity, your conventions around naming and error handling, arrives at every project already partially oriented. That orientation is the Agent Harness.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Project Harness
&lt;/h3&gt;

&lt;p&gt;The codebase-level scaffolding an agent operates within.&lt;/p&gt;

&lt;p&gt;This is where most developers are actively building right now. It is also where the tooling is most mature. A project harness includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slash commands and MCP plugins&lt;/li&gt;
&lt;li&gt;Hook scripts (PreToolUse, PostToolUse, Stop, Bash)&lt;/li&gt;
&lt;li&gt;Subdirectory &lt;code&gt;CLAUDE.md&lt;/code&gt; files scoped to specific modules&lt;/li&gt;
&lt;li&gt;Characterization tests and static analysis configuration&lt;/li&gt;
&lt;li&gt;Skills, sensors, rules, flywheels, and other "code as markdown" artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of the Project Harness as terrain. It shapes what the agent encounters as it moves through your codebase: what guardrails exist, what patterns it is expected to follow, what tools are available and where. A well-designed project harness does not just constrain the agent. It makes the right path the easy path. This is the harness that has had my attention recently.&lt;/p&gt;

&lt;p&gt;The open questions here are genuinely interesting. How granular should subdirectory context be before it becomes noise? When does a hook encode wisdom and when does it encode fear? How do you keep a project harness from calcifying, from becoming a set of rules that made sense six months ago and now just get in the way? These are craft questions, and we are only beginning to develop shared answers.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Organization Harness
&lt;/h3&gt;

&lt;p&gt;The cross-project consistency layer. And the most underbuilt harness in the stack.&lt;/p&gt;

&lt;p&gt;If the Project Harness is the terrain of a single project, the Organization Harness is the survey that makes multiple terrains legible to the same agent. Its purpose, at any scale, is to make sure an agent moving from one project to another does not have to relearn the fundamentals. Shared conventions. Common tool configurations. Policies that apply everywhere so they do not have to be restated anywhere.&lt;/p&gt;

&lt;p&gt;The Organization Harness does not require an enterprise. In a monorepo, it might be nothing more than a root-level &lt;code&gt;CLAUDE.md&lt;/code&gt; and a shared lint config. For larger organizations it scales up to approved tool registries, compliance guardrails, and governance policies. But the intent is the same whether you are a solo developer across multiple repos or a platform team serving dozens of product teams.&lt;/p&gt;

&lt;p&gt;Here is the honest state of things: almost nobody is building the Organization Harness deliberately yet. Most teams have it accidentally. A convention that emerged organically. A root &lt;code&gt;CLAUDE.md&lt;/code&gt; someone added and others quietly inherited. That is not nothing, but it is not design.&lt;/p&gt;

&lt;p&gt;Purpose-built tooling for this harness does not really exist yet. But the primitives do, and they are ones developers already know. A version-controlled shared repo can hold your org-level &lt;code&gt;CLAUDE.md&lt;/code&gt;, hook templates, and lint configs. Package managers can distribute them. For teams managing multiple separate repos today, &lt;strong&gt;git submodules&lt;/strong&gt; are an underrated pragmatic option: pull the org configuration into each project as a submodule, update it centrally, and let projects inherit changes on their own schedule.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP servers&lt;/strong&gt; are another workaround worth considering: an internal MCP server can expose org-wide tools, prompts, and resources to any agent that connects, without each project needing to vendor the configuration. It solves the distribution problem in a different way than submodules. It does not solve the harder problems: how an org-level harness gets &lt;em&gt;authored&lt;/em&gt;, how conflicts with project-level configuration get resolved, or how drift gets detected. Those gaps remain wherever the bytes live.&lt;/p&gt;

&lt;p&gt;The real gap is semantic, not technical. Which makes it exactly the kind of gap that shared vocabulary can close.&lt;/p&gt;

&lt;p&gt;This is the most interesting empty harness in the stack. As agentic workflows mature and projects multiply, inconsistency compounds quietly. The team that invests in the Organization Harness early is building something that will pay dividends in ways that are hard to attribute but impossible to miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Orchestration Harness
&lt;/h3&gt;

&lt;p&gt;Fleet-level coordination of agents. The harness where the products and frameworks are arriving faster than the patterns.&lt;/p&gt;

&lt;p&gt;Devin lives here. So do CrewAI, AutoGen, LangGraph, and swarm frameworks. So does any infrastructure that treats individual agents as nodes in a larger graph: routing work between them, managing their lifecycles, composing their outputs into something coherent. This is not configuration in the traditional sense. It is choreography. The Orchestration Harness does not shape how an agent thinks. It shapes how agents &lt;em&gt;relate&lt;/em&gt; to each other.&lt;/p&gt;

&lt;p&gt;LangGraph makes this concrete: you define a graph of agent nodes, edges that represent conditional routing between them, and state that flows through the graph as work progresses. The harness is the graph itself, the encoded decisions about which agent handles what, under what conditions, and what happens when something fails. Devin operates similarly in spirit, if not in implementation: a task enters the system, gets decomposed, gets distributed, gets reassembled. The Orchestration Harness is what holds that process together.&lt;/p&gt;

&lt;p&gt;What makes the Orchestration Harness genuinely hard is not the tooling. LangGraph and its peers are increasingly capable. It is the &lt;em&gt;design&lt;/em&gt; questions that do not have settled answers yet. When a fleet of agents is doing something you did not intend, how do you know? How do you trace causation across spawned instances? How do you encode organizational intent in a way that survives decomposition into subtasks? How do you reason about failure when the failing component is itself an agent with its own harness?&lt;/p&gt;

&lt;p&gt;These are not small questions. The Orchestration Harness is where the absence of shared vocabulary is most costly, because the systems are complex enough that imprecise language leads directly to imprecise design. And imprecise design at this scale fails in ways that are hard to diagnose and expensive to untangle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Products do not respect the taxonomy
&lt;/h2&gt;

&lt;p&gt;The reason "harness" gets muddled is that real products do not sit cleanly in one harness. They span two or three at once.&lt;/p&gt;

&lt;p&gt;Claude Code is primarily a Model Harness, but it ships Project Harness primitives: skills, commands, the &lt;code&gt;.claude/&lt;/code&gt; directory shape. Cursor straddles the Model Harness and the Project Harness. CrewAI and AutoGen blur the Agent Harness and the Orchestration Harness at the same time: they define how one agent runs and how many coordinate. LangChain sprawls across the Agent Harness, the Project Harness, and sometimes the Orchestration Harness. Devin reaches into all five.&lt;/p&gt;

&lt;p&gt;This is why the word collapses. The products are not lying. They really do span harnesses. The fix is not to pretend they do not. The fix is to name &lt;em&gt;which harness&lt;/em&gt; a product touches when we talk about it.&lt;/p&gt;




&lt;h2&gt;
  
  
  A debugging ladder
&lt;/h2&gt;

&lt;p&gt;The taxonomy earns its keep when something goes wrong.&lt;/p&gt;

&lt;p&gt;When an agent behaves unexpectedly, the instinct is to poke at whatever is most visible, usually a prompt or a config file. But the question "which harness is this a problem in?" is more useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the tool itself underperforming for this task? &lt;em&gt;(Model Harness)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Is global memory or agent configuration incomplete or contradictory? &lt;em&gt;(Agent Harness)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Is a hook misconfigured, or is a subdirectory &lt;code&gt;CLAUDE.md&lt;/code&gt; missing critical context? &lt;em&gt;(Project Harness)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Are there conflicting conventions across projects that this agent is inheriting inconsistently? &lt;em&gt;(Organization Harness)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Is the orchestration logic routing or spawning incorrectly? &lt;em&gt;(Orchestration Harness)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Five questions. Five places to look. That is not a debugging methodology. It is what shared vocabulary makes possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The attention map
&lt;/h2&gt;

&lt;p&gt;The taxonomy also makes the field's attention map visible. Most of the work right now is happening in the Model Harness (the tool wars), the Project Harness (the explosion of project-level scaffolding), and the Orchestration Harness (the multi-agent frameworks). The Agent Harness is catching up. The Organization Harness is empty.&lt;/p&gt;

&lt;p&gt;If you are looking for where the next interesting work lives, look at the empty harness.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why naming this matters
&lt;/h2&gt;

&lt;p&gt;We are, collectively, in a period of rapid accumulation. Patterns are emerging faster than they are being named. The result is that knowledge stays local: buried in individual &lt;code&gt;CLAUDE.md&lt;/code&gt; files, undocumented hook scripts, tribal conventions that do not survive team changes.&lt;/p&gt;

&lt;p&gt;Taxonomies feel like housekeeping until suddenly they are load-bearing. The goal of the Harness Stack is not to add ceremony to a field that is moving fast. It is to give the field something specific to argue about. "We need a better harness" is unanswerable today, because the next person is allowed to interpret it however they want. "We need a better Organization Harness" is an argument you can act on.&lt;/p&gt;

&lt;p&gt;I hold this loosely. The edges are genuinely blurry. The Agent Harness and the Project Harness blur when global memory starts referencing project-specific context. The Organization Harness and the Orchestration Harness blur when org policies begin governing agent spawning behavior. That is fine. A taxonomy does not need to be perfect to be useful. It needs to be shared.&lt;/p&gt;

&lt;p&gt;The rule is: when you say "harness," say which one. The taxonomy is wrong somewhere. It is a first attempt. I would rather argue about whether the Organization Harness should be called something else than keep watching engineers nod at each other and walk out of the room with five different mental models.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Does this map to how you are building, or does it break somewhere meaningful? I am curious where the names hold and where they need to be argued with. If you are working in this space, I would rather have a conversation than be right.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>webdev</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>When Vibe Coding Stops Working</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Fri, 29 May 2026 20:21:31 +0000</pubDate>
      <link>https://dev.to/tacoda/when-vibe-coding-stops-working-3nkc</link>
      <guid>https://dev.to/tacoda/when-vibe-coding-stops-working-3nkc</guid>
      <description>&lt;p&gt;Vibe coding is real, useful, and produces working software. It is also, past a certain point, a way of generating a mess faster than any one person can clean it.&lt;/p&gt;

&lt;p&gt;The term has come to mean a specific way of working with an agent: open the chat, describe what you want, look at what comes back, run it, describe the next thing, repeat. No upfront design. No structured workflow. No tests, often. Just iteration through conversation until the program does the thing.&lt;/p&gt;

&lt;p&gt;The argument against vibe coding has usually been moral: "real engineers don't do this." That argument is not interesting and it is not even right. The interesting argument is structural. Vibe coding works inside a specific envelope, and outside that envelope it produces predictable failure modes. Knowing where the envelope ends is more useful than disapproving of the technique.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where vibe coding works
&lt;/h2&gt;

&lt;p&gt;There are real categories of work where vibe coding is the correct approach.&lt;/p&gt;

&lt;p&gt;Throwaway scripts. A one-off data transformation, a quick automation, something you will run twice and never touch again. The cost of getting it slightly wrong is bounded. The cost of building any scaffolding is unrecoverable. Vibe coding is the rational choice.&lt;/p&gt;

&lt;p&gt;Exploration in a new domain. You do not know what you want yet. You are sketching, trying things, building intuition. Imposing structure on a problem you have not understood is premature. Vibe coding lets you find the shape of the problem before you commit to a shape for the solution.&lt;/p&gt;

&lt;p&gt;Small, isolated additions to a codebase you understand. A new endpoint shaped like the existing endpoints. A new component shaped like the existing components. You know the patterns; the agent can fill them in. The supervision is light because the risk is light.&lt;/p&gt;

&lt;p&gt;Solo projects with no collaborators. You are the entire audience for the code. Future-you might be annoyed, but future-you does not have to coordinate with anyone else about how the code is organized. The cost of incoherence is internal.&lt;/p&gt;

&lt;p&gt;In each of these cases, the upper bound on what can go wrong is small. The codebase is not going to grow much. The team is not going to onboard people who need to understand it. The systems it depends on are not going to evolve out from under it. Vibe coding stays inside its envelope because the envelope is tight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it stops working
&lt;/h2&gt;

&lt;p&gt;The same approach, in a larger context, fails in a specific way.&lt;/p&gt;

&lt;p&gt;It does not fail catastrophically. There is no single moment when the vibe stops working. It fails gradually, as the codebase accumulates choices that were never coordinated, conventions that were never documented, abstractions that were introduced because the model felt like introducing them, and tests that were never written because the chat was about getting the thing to work, not about proving it would keep working.&lt;/p&gt;

&lt;p&gt;The signs are recognizable:&lt;/p&gt;

&lt;p&gt;Every new feature takes longer than the last one, not because the system is more complex but because the agent has to spend more context reading the inconsistencies before it can add anything.&lt;/p&gt;

&lt;p&gt;Two changes in different files produce conflicting patterns, and nobody can say which one is "right" because there is no document that says.&lt;/p&gt;

&lt;p&gt;A regression appears that nobody can trace, because the change that introduced it was made in a chat session that nobody saved.&lt;/p&gt;

&lt;p&gt;A new team member joins and asks where the conventions are written down, and the answer is "ask the agent" or "look at recent code," which is the answer that means the conventions do not actually exist.&lt;/p&gt;

&lt;p&gt;The agent itself starts producing worse output, because the codebase it is pattern-matching against has become a patchwork of patterns rather than a single coherent style.&lt;/p&gt;

&lt;p&gt;None of these are failures of vibe coding as a technique. They are the cost of using a technique past its envelope.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thresholds
&lt;/h2&gt;

&lt;p&gt;The envelope ends at a few specific thresholds. Crossing any of them is the signal to switch modes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codebase size.&lt;/strong&gt; Once the project is large enough that nobody on the team can hold the whole thing in their head, the agent cannot either. The convention has to be on disk, not in the conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Team size.&lt;/strong&gt; Once more than one person is contributing, the cost of incoherence falls on people who were not in the original chat. Vibe coding becomes a tax on collaborators. Convention has to be shared, which means written down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time horizon.&lt;/strong&gt; Code that will run in production next year is code that will need to be modified next year, by someone who is not currently in the conversation. The conversation does not survive. The code does. Convention has to be in the code or near it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regression risk.&lt;/strong&gt; The moment a bug in this code could affect a customer, a system, or a number that matters, the cost of "we'll fix it if we notice" exceeds the cost of having tests and a real PR process.&lt;/p&gt;

&lt;p&gt;Crossing any one of these is the threshold. The mistake is to keep using the same approach past the threshold because it worked yesterday.&lt;/p&gt;

&lt;h2&gt;
  
  
  The upgrade path
&lt;/h2&gt;

&lt;p&gt;If you have been vibe coding and you have crossed the threshold, the upgrade is not a return to "real engineering" in some pre-agent sense. It is a transition to a different mode of working with the agent. One in which the harness, the conventions, and the sensors do the work that the conversation was doing implicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  The minimum upgrade:
&lt;/h3&gt;

&lt;p&gt;Write down the conventions the agent has been inferring. Move them out of the chat history and into a file the next session loads.&lt;/p&gt;

&lt;p&gt;Add the tests for the parts you have been treating as "I'll check it works manually." Future sessions cannot check it manually; the test is what proves it across changes.&lt;/p&gt;

&lt;p&gt;Adopt a PR practice. Even for a solo project, the PR is where the change is reviewed and the checks run. It is the gate that turns vibes into commits.&lt;/p&gt;

&lt;p&gt;Start a rules file, even a small one. The first three rules in it should be the three things you have corrected the agent on most recently.&lt;/p&gt;

&lt;p&gt;None of this is a renunciation of vibe coding. It is the recognition that the technique works inside its envelope, and that outside the envelope a different set of techniques does the same job for a different problem. Use the right one for the situation.&lt;/p&gt;

&lt;p&gt;The teams that get into trouble are not the ones that vibe coded. They are the ones that vibe coded past the point where it stopped working and did not notice.&lt;/p&gt;

</description>
      <category>vibecoding</category>
      <category>programming</category>
      <category>agents</category>
      <category>ai</category>
    </item>
    <item>
      <title>Code Review When Half the Diffs Are From Agents</title>
      <dc:creator>Ian Johnson</dc:creator>
      <pubDate>Fri, 29 May 2026 18:14:49 +0000</pubDate>
      <link>https://dev.to/tacoda/code-review-when-half-the-diffs-are-from-agents-2ph0</link>
      <guid>https://dev.to/tacoda/code-review-when-half-the-diffs-are-from-agents-2ph0</guid>
      <description>&lt;p&gt;Code review was invented for a world in which a human wrote each diff, slowly, and another human read it, also slowly. The ratio held. Reviewers could plausibly read every line of every change without becoming the bottleneck. The practice scaled because both sides scaled at roughly the same rate.&lt;/p&gt;

&lt;p&gt;Agents break the ratio. The author writes faster than the reviewer can read. If the reviewer's practice does not change, one of two things happens: the reviewer becomes the bottleneck and the agent's throughput is wasted, or the reviewer keeps the cadence and starts rubber-stamping. Most teams quietly choose option two and pretend they chose option one.&lt;/p&gt;

&lt;p&gt;This is fixable. But it requires admitting that the review practice itself has to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  What code review is actually for
&lt;/h2&gt;

&lt;p&gt;Strip code review down and it serves a small number of real purposes: catching bugs the author missed, verifying the change does what the PR description claims, sharing knowledge across the team, enforcing standards the tools cannot, and holding contributors to a baseline of care.&lt;/p&gt;

&lt;p&gt;When the contributor was a human, all of these blurred together because the same act (reading the diff line by line) addressed most of them. The reviewer would catch a typo, notice the missing edge case, learn how the new module worked, and check that the naming was consistent, all in the same pass.&lt;/p&gt;

&lt;p&gt;This worked because human authorship is &lt;em&gt;slow and uneven&lt;/em&gt;. The cost of writing the next line was high enough that the author was thinking, which meant the diff carried signal about intent on every line. Reading every line was reading every decision.&lt;/p&gt;

&lt;p&gt;Agents do not author that way. They produce a lot of correct-looking code very fast, and the cost of any individual line is low. Reading every line of an agent-authored PR is not reading every decision; it is reading a lot of plausible boilerplate looking for the few places where a decision actually happened.&lt;/p&gt;

&lt;p&gt;The shift in review practice has to follow the shift in authorship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Higher-level questions
&lt;/h2&gt;

&lt;p&gt;The questions that matter on an agent-authored PR are not "does line 47 do the right thing?" They are:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Did the change touch the files it should have touched, and only those?&lt;/em&gt; Agents often produce edits in surprising places: files they did not need to modify, configs they decided to "while I'm here" update. The diff scope itself is signal.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Does the test prove the thing it claims to prove?&lt;/em&gt; Agents are great at producing tests that pass. They are less great at producing tests that would &lt;em&gt;fail&lt;/em&gt; if the code were wrong. A test that asserts the function returned, without asserting what it returned, is a test that adds confidence without adding evidence.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Does the change follow the patterns the rest of the codebase uses?&lt;/em&gt; Or does it introduce a new pattern, in a way that suggests the agent did not see, or did not follow, the existing one? New patterns introduced without discussion are the source of most agent-induced drift.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Is the abstraction the change introduces actually needed?&lt;/em&gt; Agents have a strong bias toward extracting helpers, adding flags, and parameterizing things. Most of those abstractions are pre-emptive and wrong. The reviewer's job is to push back on speculative generality.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Does the PR description match the diff?&lt;/em&gt; A surprisingly cheap check that catches a real category of problem: the agent solved a different problem than the one stated, and the description was written to match the solution rather than the original intent.&lt;/p&gt;

&lt;p&gt;None of these questions require reading every line. All of them require understanding the shape of the change. The shift is from line-by-line to shape-first.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to read every line anyway
&lt;/h2&gt;

&lt;p&gt;Some changes still warrant the old practice. The categories that earn line-level review:&lt;/p&gt;

&lt;p&gt;Changes to security-sensitive code. Auth, permissions, anything touching user data, anything that talks to the outside world. The cost of a missed bug is too high to leave to shape-first review.&lt;/p&gt;

&lt;p&gt;Changes to code the team is unfamiliar with. If the diff is in a module nobody on the team has touched in a year, the act of reading it line by line &lt;em&gt;is&lt;/em&gt; the knowledge-sharing. Skipping it loses the only chance to update the team's mental model.&lt;/p&gt;

&lt;p&gt;Changes that the harness sensors flag with anything other than green. Linter warnings, test flakes, type errors in adjacent files, and so on. All of these are reasons to slow down and read.&lt;/p&gt;

&lt;p&gt;Changes that look too simple. If a one-line diff makes a hard problem disappear, the most likely explanation is that the problem is still there and the diff just stopped reporting it.&lt;/p&gt;

&lt;p&gt;For everything else, shape-first is the rational allocation of attention. Pretending you can read line by line at agent throughput is the way you end up rubber-stamping the cases that matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't let the agent review its own work
&lt;/h2&gt;

&lt;p&gt;A particular anti-pattern worth naming: using an agent to review an agent's PR. This is appealing (the throughput finally matches) and it produces almost zero signal. The reviewing agent has the same priors as the authoring agent, the same blind spots, and the same training. It will find typos and miss the architectural mistake. It will praise the test coverage and not notice that the tests do not actually test the thing.&lt;/p&gt;

&lt;p&gt;Agent-assisted review is fine. The agent can summarize the diff, flag suspicious patterns, run extra checks, draft a first-pass comment. But the final review must come from a human who reads the shape of the change and asks the questions above. Outsourcing that to another model is outsourcing the only step that was load-bearing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the team has to give up
&lt;/h2&gt;

&lt;p&gt;The hardest part of changing review practice is admitting that the old practice (read every line, comment on what you see) was &lt;em&gt;always&lt;/em&gt; partly theater. Reviewers caught some bugs, missed many more, and got most of their value from forcing the author to think harder before posting. Agents are not embarrassed by review. They will happily produce another version. The forcing function that human-on-human review provided is gone.&lt;/p&gt;

&lt;p&gt;What replaces it is a review practice that does not pretend. Read the shape. Ask the architectural questions. Trust the sensors for the line-level stuff. Slow down for the categories that warrant it. Stop measuring reviewer effectiveness in comments per PR; start measuring it in problems caught before merge.&lt;/p&gt;

&lt;p&gt;The point of review is not to read every line. It never was. With agents, that finally becomes obvious.&lt;/p&gt;

</description>
      <category>codereview</category>
      <category>agents</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
