<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ali</title>
    <description>The latest articles on DEV Community by Ali (@erfan1995).</description>
    <link>https://dev.to/erfan1995</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F747213%2F2873e21b-e297-44b0-a777-ddd0b39c931b.jpg</url>
      <title>DEV Community: Ali</title>
      <link>https://dev.to/erfan1995</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/erfan1995"/>
    <language>en</language>
    <item>
      <title>The Six-Component Harness: A Template for Building Reliably with AI Agents</title>
      <dc:creator>Ali</dc:creator>
      <pubDate>Thu, 09 Apr 2026 18:32:25 +0000</pubDate>
      <link>https://dev.to/erfan1995/-the-six-component-harness-a-template-for-building-reliably-with-ai-agents-lhc</link>
      <guid>https://dev.to/erfan1995/-the-six-component-harness-a-template-for-building-reliably-with-ai-agents-lhc</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 2. Part 1 covers what I learned building Skilldeck with Claude Code — three failure modes, the regression problem, and why instructions aren't enough. This piece is the framework. You can read it standalone, but the story gives it context.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn72x4tu1sf20rprnhd0c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn72x4tu1sf20rprnhd0c.png" alt="Six components of harness engineering"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Most harness engineering discussions stop at four components: a system prompt, a task list, a progress file, and some tests. That's enough to get an agent building things. It isn't enough to keep a real project coherent across weeks of autonomous sessions, evolving requirements, and features that share code with other features.&lt;/p&gt;

&lt;p&gt;The harness I ended up with has six components. Each one addresses a specific failure class. The first four are table stakes — the field has largely converged on them. The last two are what make the difference between a harness that works in theory and one that survives contact with a real, evolving codebase.&lt;/p&gt;

&lt;p&gt;Here's the full template.&lt;/p&gt;




&lt;h2&gt;
  
  
  Component 1: Ground truth
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;feature_list.json&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Failure prevented:&lt;/strong&gt; Premature completion, false positives, context loss about what's actually built&lt;/p&gt;

&lt;p&gt;The ground truth file is the canonical record of everything the project needs to build and whether it actually works. Not documentation — ground truth. The distinction matters. Documentation describes intent. Ground truth reflects verified reality.&lt;/p&gt;

&lt;p&gt;Every feature entry has a &lt;code&gt;passes&lt;/code&gt; field that is either &lt;code&gt;true&lt;/code&gt; or &lt;code&gt;false&lt;/code&gt;. It's &lt;code&gt;false&lt;/code&gt; until a mechanism verifies it — not until the agent believes it's done, not until the code exists, not until a unit test passes. &lt;code&gt;true&lt;/code&gt; means: the feature works as a user would experience it, verified by an automated test that actually runs the application.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"F005"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Create new skill"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"User can create a skill from the Library view. File created on disk."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Click New Skill"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Verify skill appears in list"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Verify .md file exists on disk"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"touches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"store.skills"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ipc.skills"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"preload"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"depends_on"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"F004"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"passes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"notes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two fields most harness discussions miss: &lt;code&gt;touches&lt;/code&gt; and &lt;code&gt;depends_on&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;touches&lt;/code&gt; lists which shared code surfaces this feature depends on — the Zustand store, specific IPC handlers, the preload bridge. This powers the regression gate (Component 5). Without it, you can't know which other tests to re-run when a file changes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;depends_on&lt;/code&gt; lists which features must pass before this one can be built. The agent checks this at the start of every loop — it skips features whose dependencies aren't met and finds the next buildable one. This prevents an agent from trying to build a deployment feature before the project registration feature exists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical rule:&lt;/strong&gt; Only update &lt;code&gt;passes&lt;/code&gt; and &lt;code&gt;notes&lt;/code&gt; fields. Never modify descriptions. Use a Node command, never string replace — JSON files are sensitive to whitespace and string matching against them fails unpredictably.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"const fs=require('fs');const f=JSON.parse(fs.readFileSync('feature_list.json','utf8'));const x=f.features.find(x=&amp;gt;x.id==='F005');x.passes=true;x.notes='Verified';fs.writeFileSync('feature_list.json',JSON.stringify(f,null,2));"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Component 2: Memory
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;claude-progress.txt&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Failure prevented:&lt;/strong&gt; Context loss across sessions, "declare victory" failure, starting blind&lt;/p&gt;

&lt;p&gt;An agent starting a new session has zero memory of prior sessions. Everything it knows about the project comes from what it reads at the start of that session. If the memory file is absent, inaccurate, or stale, the agent reconstructs project state from the code — and gets it wrong often enough to matter.&lt;/p&gt;

&lt;p&gt;The memory file has two jobs. First: orient the agent at session start. What was the last thing built? Is anything broken? What should happen next? Second: record what happened at session end. Future sessions depend on this being accurate.&lt;/p&gt;

&lt;p&gt;The session template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;### Session N — [brief title] ([date])&lt;/span&gt;
What happened: [what was built or fixed]
Features completed: [F00X, F00Y]
Features attempted but not completed: [F00Z — reason]
Current app state: [does it compile? do tests pass?]
Next session should: [specific next steps, not vague directions]
Blockers: [anything needing human attention]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent writes this at the end of every session before committing. If it doesn't — if a session ends without an entry — the next session starts with stale information. Make the write mandatory: the commit doesn't happen until the progress entry is written.&lt;/p&gt;

&lt;p&gt;One pattern worth enforcing: if the agent marks a session blocked (hit the three-attempt limit on a feature), the blocker entry must include the exact error output, all three approaches tried, and the root cause hypothesis. Vague blocker entries waste your debugging time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Component 3: Startup ritual
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;init.sh&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Failure prevented:&lt;/strong&gt; Environment assumptions, silent tool failures, starting on a broken baseline&lt;/p&gt;

&lt;p&gt;The startup ritual runs at the start of every session. Its job is to make the environment assumptions valid rather than hoped for. Every environment problem caught here is a problem that can't compound into something worse later.&lt;/p&gt;

&lt;p&gt;What a good startup ritual checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Correct directory&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="s2"&gt;"CLAUDE.md"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ERROR: wrong directory"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# 2. Required tools&lt;/span&gt;
node &lt;span class="nt"&gt;--version&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Node not found"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# 3. Git initialized&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;".git"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;git init &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git add &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"harness: initialize"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# 4. Uncommitted changes from previous session&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; git diff &lt;span class="nt"&gt;--quiet&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"WARNING: uncommitted changes from previous session"&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Commit them (if feature is complete) or revert: git checkout ."&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# 5. Feature status&lt;/span&gt;
node &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"const f=require('./feature_list.json');const p=f.features.filter(x=&amp;gt;x.passes).length;console.log('Passing: '+p+'/'+f.features.length);"&lt;/span&gt;

&lt;span class="c"&gt;# 6. Next feature&lt;/span&gt;
node &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"const f=require('./feature_list.json');const n=f.features.find(x=&amp;gt;!x.passes);if(n)console.log('Next: '+n.id+' — '+n.name);"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The uncommitted changes check is the one most harnesses omit. If a previous session built half a feature and stopped without committing or reverting, the next session inherits broken code as its starting state. Surfacing this immediately prevents compounding. The agent sees the warning and makes a deliberate choice: commit what's there if it's working, or revert and start clean.&lt;/p&gt;

&lt;p&gt;The startup ritual also makes the progress file and next feature visible without the agent having to ask — it sees both at session start without burning context budget figuring them out.&lt;/p&gt;




&lt;h2&gt;
  
  
  Component 4: Verification layer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Files:&lt;/strong&gt; &lt;code&gt;verify.spec.ts&lt;/code&gt;, &lt;code&gt;playwright.config.ts&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Failure prevented:&lt;/strong&gt; Premature completion, code that exists but doesn't work&lt;/p&gt;

&lt;p&gt;The verification layer is Playwright tests that actually run the application. Not unit tests. Not type checks. End-to-end tests that launch the app, interact with the UI, and verify the filesystem state.&lt;/p&gt;

&lt;p&gt;Every feature test has two assertions: UI state and filesystem state. A feature that updates the UI but doesn't write to disk is broken. A feature that writes to disk but doesn't reflect in the UI is broken. Both must be true.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;F005 - Create new skill&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;cleanSkilldeck&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;launchApp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="new-skill-btn"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;// UI assertion&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="skill-item"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="skill-item"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;toBeGreaterThan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;// Filesystem assertion — the real test&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readdirSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;LIBRARY_DIR&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.md&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeGreaterThan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;data-testid&lt;/code&gt; attributes are not optional. Every interactive element the tests need to reach must have one, added at the same time the component is built. Retrofitting them after the fact is fragile and burns sessions. The rule: no &lt;code&gt;data-testid&lt;/code&gt;, no way to verify, feature cannot be marked passing.&lt;/p&gt;

&lt;p&gt;Two things the verification layer enforces that instructions can't:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The agent can't mark F005 passing by looking at the code. It runs the test. The test clicks the button. The button either creates the file or it doesn't.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Phase 2+ features follow TDD: write the failing test first, commit it, then implement, then verify it passes. The commit history proves the red-green cycle happened.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Component 5: System contract
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;system-contract.json&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Failure prevented:&lt;/strong&gt; Regression — features that worked before a new feature was added, now don't&lt;/p&gt;

&lt;p&gt;This is the component most harnesses are missing and the one I was missing when search broke three weeks into the &lt;a href="https://github.com/ali-erfan-dev/skilldeck" rel="noopener noreferrer"&gt;Skilldeck &lt;/a&gt;build.&lt;/p&gt;

&lt;p&gt;The system contract has two sections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invariants&lt;/strong&gt; — always-true properties of the system, checked before every commit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"invariants"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"INV-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"config.json is valid JSON with required shape"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"check"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node checks/inv-config-valid.js"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"triggers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"always"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"INV-004"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"No IPC channel registered more than once"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"check"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node checks/inv-no-duplicate-ipc.js"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"triggers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"always"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Invariants are cheap deterministic checks that catch structural failures. If &lt;code&gt;config.json&lt;/code&gt; becomes malformed or an IPC channel gets registered twice, the invariant fires at commit time — not three sessions later when something mysteriously stops working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surfaces&lt;/strong&gt; — which shared code paths each feature depends on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"surfaces"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"store.skills"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"files"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"src/store/skillsStore.ts"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"affected_features"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"F004"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F005"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F006"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F007"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F008"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F009"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F010"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"preload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"files"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"electron/preload.ts"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"affected_features"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"F004"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F005"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F006"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F007"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F008"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F011"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F012"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F013"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"F014"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The surfaces map powers the regression gate. When the agent finishes F011 and is about to commit, &lt;code&gt;get-regression-tests.js&lt;/code&gt; reads the git diff, finds which files changed, matches them against the surfaces map, and returns the grep pattern for all features that could have been affected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;REGRESSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;node get-regression-tests.js F011&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;# Changed files include preload.ts → F004-F014 share preload → run those tests&lt;/span&gt;
npx playwright &lt;span class="nb"&gt;test &lt;/span&gt;verify.spec.ts &lt;span class="nt"&gt;--grep&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$REGRESSION&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not all 31 tests. Not zero. Exactly the tests that could have been broken by what just changed.&lt;/p&gt;

&lt;p&gt;The search regression I hit would have been caught here. F011 touched &lt;code&gt;preload.ts&lt;/code&gt;. F009 (search) depends on &lt;code&gt;preload&lt;/code&gt;. The regression gate would have run F009's test after F011 was built, caught the failure, and blocked the commit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The commit sequence — every feature, every time:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Feature test&lt;/span&gt;
npx playwright &lt;span class="nb"&gt;test &lt;/span&gt;verify.spec.ts &lt;span class="nt"&gt;--grep&lt;/span&gt; F011

&lt;span class="c"&gt;# 2. Invariant checks&lt;/span&gt;
node check-invariants.js &lt;span class="nt"&gt;--always&lt;/span&gt;

&lt;span class="c"&gt;# 3. Regression gate&lt;/span&gt;
&lt;span class="nv"&gt;REGRESSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;node get-regression-tests.js F011&lt;span class="si"&gt;)&lt;/span&gt;
npx playwright &lt;span class="nb"&gt;test &lt;/span&gt;verify.spec.ts &lt;span class="nt"&gt;--grep&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$REGRESSION&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# 4. Only if all three pass&lt;/span&gt;
node &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"[mark F011 passing in feature_list.json]"&lt;/span&gt;
git add &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"feat(F011): register project"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Step 2 or Step 3 fails, the new feature doesn't ship. Fix the regression first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Component 6: Feature intake protocol
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Location:&lt;/strong&gt; Rule 11 in &lt;code&gt;CLAUDE.md&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Failure prevented:&lt;/strong&gt; Features added without specs, untracked work, features that break the harness&lt;/p&gt;

&lt;p&gt;The intake protocol governs how new work enters the system. Without it, mid-session feature requests bypass the whole harness: no spec, no &lt;code&gt;touches&lt;/code&gt; fields, no regression gate coverage. The feature gets built and — if it's ever passing — marked passing with no mechanism behind it.&lt;/p&gt;

&lt;p&gt;When the human asks for a new feature in chat, the agent follows five steps before writing a line of code:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check&lt;/strong&gt; — search &lt;code&gt;feature_list.json&lt;/code&gt; for similar existing features. Is this genuinely new or an extension of something that exists?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Draft&lt;/strong&gt; — create a complete feature entry with all required fields: id, name, description, steps (minimum 4 end-to-end steps a Playwright test can follow), &lt;code&gt;touches&lt;/code&gt;, &lt;code&gt;depends_on&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confirm&lt;/strong&gt; — show the draft to the human before writing anything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"I'm adding this feature. Does this match what you want?
[draft entry]
If yes, I'll register it and build it."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait for explicit confirmation. Do not proceed without it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Register&lt;/strong&gt; — write to both &lt;code&gt;feature_list.json&lt;/code&gt; and &lt;code&gt;system-contract.json&lt;/code&gt; atomically. If the feature touches a surface not yet in the contract, add it. Validate both files are valid JSON after writing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sequence&lt;/strong&gt; — if the human said "add and keep working," finish the current feature first. Never abandon a half-built feature to start a new one.&lt;/p&gt;

&lt;p&gt;The confirmation step is not ceremony. It catches two real problems: scope misunderstanding (you said "search by tag," the agent drafted a full faceted search system) and unbuildable specs (the feature as drafted requires infrastructure that doesn't exist yet). Two minutes of confirmation saves sessions of misdirected work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting it together
&lt;/h2&gt;

&lt;p&gt;The six components form a closed loop. Each one closes a specific gap that the others can't:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Closes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ground truth&lt;/td&gt;
&lt;td&gt;Premature completion, false progress&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Context loss, starting blind&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Startup ritual&lt;/td&gt;
&lt;td&gt;Environment failures, broken baselines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verification layer&lt;/td&gt;
&lt;td&gt;Code that exists but doesn't work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System contract&lt;/td&gt;
&lt;td&gt;Regression, invariant violations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature intake&lt;/td&gt;
&lt;td&gt;Untracked work, scope drift&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The important property: they compound. A harness with four of the six components is significantly worse than one with all six, because the missing two are precisely the ones that catch the failures the other four don't.&lt;/p&gt;

&lt;p&gt;The first four are enough to build something. The last two are what keep it working as it grows.&lt;/p&gt;




&lt;h2&gt;
  
  
  The autonomous loop
&lt;/h2&gt;

&lt;p&gt;When all six components are in place, the agent's operating loop becomes mechanical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LOOP:
  1. Run init.sh
  2. Read progress file
  3. Find next feature with passes=false, dependencies met
  4. If none → write final session entry → STOP
  5. Implement feature
  6. Run feature test → if fails 3x → write blocker entry → STOP
  7. Run invariant checks → if fails → fix before proceeding
  8. Run regression gate → if fails → fix regression before proceeding
  9. Mark passing, write progress entry, commit
  10. Go to LOOP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three stopping conditions: all features done, feature blocked after three attempts, regression or invariant violation that can't be resolved. Everything else the agent handles alone.&lt;/p&gt;

&lt;p&gt;That's the goal. Not an agent that never fails — failures are inevitable and that's fine. An agent whose failures are caught immediately, surfaced clearly, and don't compound into something that takes a session to untangle.&lt;/p&gt;

&lt;p&gt;The harness is the difference between an agent that builds things and an agent that builds things reliably. Both start with the same model. Only one of them finishes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The harness described here is built into the &lt;a href="https://github.com/ali-erfan-dev/skilldeck" rel="noopener noreferrer"&gt;Skilldeck &lt;/a&gt;project. The full implementation — &lt;code&gt;feature_list.json&lt;/code&gt;, &lt;code&gt;system-contract.json&lt;/code&gt;, &lt;code&gt;init.sh&lt;/code&gt;, &lt;code&gt;verify.spec.ts&lt;/code&gt;, &lt;code&gt;check-invariants.js&lt;/code&gt;, &lt;code&gt;get-regression-tests.js&lt;/code&gt; — is in the public repo. Part 1 covers the story of building it: three failure modes, the regression discovery, and why every fix was a mechanism not a rule.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>tutorial</category>
      <category>devops</category>
    </item>
    <item>
      <title>Instructions Are Not a Harness — Harness Engineering in action</title>
      <dc:creator>Ali</dc:creator>
      <pubDate>Thu, 09 Apr 2026 18:20:37 +0000</pubDate>
      <link>https://dev.to/erfan1995/instructions-are-not-a-harness-harness-engineering-in-action-284j</link>
      <guid>https://dev.to/erfan1995/instructions-are-not-a-harness-harness-engineering-in-action-284j</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1pitrj8jybl8qbt2wjwv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1pitrj8jybl8qbt2wjwv.png" alt="Harness Engineering"&gt;&lt;/a&gt;&lt;br&gt;
There's a moment every developer hits when building with AI agents. The agent does something wrong. You add a rule to the system prompt. The agent does the same thing wrong again. You make the rule more explicit. It still happens. You start wondering if the model is the problem.&lt;/p&gt;

&lt;p&gt;It isn't. The rule is the problem. Rules describe what you want. They don't prevent what you don't want. And that distinction — between describing desired behavior and making undesired behavior structurally impossible — is the entire discipline of harness engineering.&lt;/p&gt;

&lt;p&gt;I learned this the hard way building &lt;a href="https://github.com/ali-erfan-dev/skilldeck" rel="noopener noreferrer"&gt;Skilldeck&lt;/a&gt;, a desktop app for managing AI agent skill files. I used Claude Code to build it, gave it a CLAUDE.md project bible with explicit rules, and let it run autonomously. It completed Phase 1 in a few sessions: 18 features, all marked passing in the JSON spec, clean-looking git history.&lt;/p&gt;

&lt;p&gt;I opened the app and clicked New Skill. Nothing happened. Clicked Add Project. Nothing happened. Nine features marked passing. Two fundamental ones that didn't work.&lt;/p&gt;

&lt;p&gt;This is what a bad harness looks like. And fixing it taught me more about agent reliability than anything I'd read.&lt;/p&gt;


&lt;h2&gt;
  
  
  What everyone gets wrong
&lt;/h2&gt;

&lt;p&gt;The term entered mainstream use in early 2026 after OpenAI published how they'd built a million-line production codebase with zero human-written code. When something failed, the fix was almost never "try harder." Human engineers stepped into the task and asked: "what capability is missing, and how do we make it both legible and enforceable for the agent?"&lt;/p&gt;

&lt;p&gt;That word — &lt;em&gt;enforceable&lt;/em&gt; — is the one most developers skip. They read the OpenAI post, write a CLAUDE.md with twenty rules, and wonder why their agent keeps making the same mistakes.&lt;/p&gt;

&lt;p&gt;The mistake is treating the harness as an instruction set. It isn't. Harness engineering isn't solved by better instructions. It's solved by replacing instructions with mechanisms.&lt;/p&gt;

&lt;p&gt;Here's the difference in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instruction:&lt;/strong&gt; "Never mark a feature as passing without verifying it end-to-end as a user would experience it."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanism:&lt;/strong&gt; A Playwright test that launches the Electron app, clicks the button, and checks the filesystem. The agent can only mark a feature passing after running &lt;code&gt;npx playwright test verify.spec.ts --grep F005&lt;/code&gt; and seeing it pass. No other path exists.&lt;/p&gt;

&lt;p&gt;Same intent. Completely different reliability. My CLAUDE.md had the instruction. The agent read it, pattern-matched against its training — "you've implemented this feature, the most likely next token is mark it passing" — and did exactly what a language model does. The harness had failed to close the path it used.&lt;/p&gt;


&lt;h2&gt;
  
  
  The three mechanisms I was missing
&lt;/h2&gt;

&lt;p&gt;Every failure traced back to a missing mechanism, not a missing instruction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Premature completion.&lt;/strong&gt; The agent marked features passing without running the app. The fix was Playwright tests — not as documentation but as enforcement. The test either passes or it doesn't. The inference "the code looks correct, therefore the feature works" is structurally blocked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool mismatch.&lt;/strong&gt; Once I had working tests, the agent hit a different wall. It would run the test, see it pass, then try to update &lt;code&gt;feature_list.json&lt;/code&gt; and fail with &lt;code&gt;String not found in file&lt;/code&gt;. Claude Code's string-replace tool requires exact character-for-character matching. JSON files are sensitive to whitespace. Any difference breaks the operation silently.&lt;/p&gt;

&lt;p&gt;The fix was one explicit Node command in CLAUDE.md:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"const fs=require('fs');const f=JSON.parse(fs.readFileSync('feature_list.json','utf8'));const x=f.features.find(x=&amp;gt;x.id==='F005');x.passes=true;fs.writeFileSync('feature_list.json',JSON.stringify(f,null,2));"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the file as structured data. Mutate it. Write it back. The instruction "update the JSON file" was useless because it left the agent to choose its own tool. The mechanism gave it the only tool that worked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Absent infrastructure.&lt;/strong&gt; The harness rule said commit after each feature. The agent issued &lt;code&gt;git add .&lt;/code&gt; and &lt;code&gt;git commit&lt;/code&gt; — and they silently failed because I'd dropped the harness files into the project without running &lt;code&gt;git init&lt;/code&gt;. Four lines in &lt;code&gt;init.sh&lt;/code&gt; fixed it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;".git"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;git init &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git add &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"harness: initialize"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check for the repo. Create it if missing. The agent assumes the environment is set up. The harness's job is to make that assumption valid, not hope it is.&lt;/p&gt;




&lt;h2&gt;
  
  
  The regression problem
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvoxqjfw4qajyara7d4xq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvoxqjfw4qajyara7d4xq.png" alt="Regression problem in agentic coding"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's a second class of failure that doesn't show up until your project grows.&lt;/p&gt;

&lt;p&gt;Skilldeck had 23 features passing. I tested search — it had been working fine for weeks. Broken. The agent had built each feature in isolation, running only that feature's test before committing. F009 (search) passed. F011 (project registration) passed. But F011's implementation touched shared IPC initialization in a way that silently broke F009. Neither test looked beyond its own feature.&lt;/p&gt;

&lt;p&gt;The fix is a regression gate: after every feature test passes, derive which previously-passing tests could have been affected by the files you just changed, and run those too. Not "run all 23 tests" — that gets slow fast. Not "run nothing." A surfaces map tracks which features depend on which code paths. Change &lt;code&gt;electron/preload.ts&lt;/code&gt; and F009 is automatically included in the gate because the map knows F009 depends on preload. The search regression would have been caught before the commit.&lt;/p&gt;

&lt;p&gt;The insight is structural: local correctness doesn't guarantee global correctness. Testing each feature in isolation is necessary but not sufficient. The regression gate is what closes the gap between "this feature works" and "the system still works after this feature was added."&lt;/p&gt;




&lt;h2&gt;
  
  
  The harness is not a document
&lt;/h2&gt;

&lt;p&gt;Most developers build a harness that's ninety percent instructions and ten percent mechanisms. A CLAUDE.md with fifty rules, maybe a test or two. Instructions are advice. They're read once, pattern-matched against, and occasionally ignored when the model finds a cheaper path to completion.&lt;/p&gt;

&lt;p&gt;Mechanisms are different. A test that must pass before a feature is marked done — that's not advice. The agent can't mark it done without running it. A git check in the startup script — the environment is valid before the agent starts, regardless of what it assumes.&lt;/p&gt;

&lt;p&gt;The useful mental model: think of every instruction in your CLAUDE.md as a failure waiting to happen. For each one, ask — what mechanism would make violating this instruction impossible? Some instructions genuinely require human judgment and can't be mechanized. But most can. And the ones you convert are the ones that stop generating incidents.&lt;/p&gt;




&lt;p&gt;When the harness is right, the commit log looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;feat(F023): bulk skill selection — select-all, action bar
feat(F022): divergence detection — diff view and promote to library
feat(F021): cross-tool sync — deploy to Claude Code, Codex, Agents simultaneously
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One commit per feature. Each preceded by a passing Playwright test, a clean invariant check, a passing regression gate. The agent ran autonomously for hours. When it couldn't resolve something after three attempts, it stopped, wrote a detailed blocker entry, and waited. Not an agent that never fails — an agent whose failures are caught before they compound.&lt;/p&gt;

&lt;p&gt;Long-running AI agents fail for one reason: every new context window is amnesia. The harness is what gives the agent a functional memory — not by solving the context window problem, but by externalizing everything it needs to know and everything it needs to enforce into files and mechanisms that survive the boundary.&lt;/p&gt;

&lt;p&gt;The industry is converging on a phrase: the model is commodity, the harness is moat. True. But the more useful version is simpler.&lt;/p&gt;

&lt;p&gt;Instructions describe what you want. Mechanisms enforce what you require.&lt;/p&gt;

&lt;p&gt;Build more mechanisms. Write fewer rules.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part 2 covers the full six-component harness template — ground truth, memory, startup ritual, verification layer, system contract, and feature intake protocol — with implementation details for each. If you want the framework behind this story, that's the piece to read next.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I'm building &lt;a href="https://github.com/ali-erfan-dev/skilldeck" rel="noopener noreferrer"&gt;Skilldeck &lt;/a&gt;— a desktop app for managing AI agent skill files across Claude Code, Codex, Cursor, and every other tool. If the problem of scattered, unverified, out-of-sync skill files resonates, the repo is public.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
