DEV Community: Ali

The Six-Component Harness: A Template for Building Reliably with AI Agents

Ali — Thu, 09 Apr 2026 18:32:25 +0000

This is Part 2. Part 1 covers what I learned building Skilldeck with Claude Code — three failure modes, the regression problem, and why instructions aren't enough. This piece is the framework. You can read it standalone, but the story gives it context.

Most harness engineering discussions stop at four components: a system prompt, a task list, a progress file, and some tests. That's enough to get an agent building things. It isn't enough to keep a real project coherent across weeks of autonomous sessions, evolving requirements, and features that share code with other features.

The harness I ended up with has six components. Each one addresses a specific failure class. The first four are table stakes — the field has largely converged on them. The last two are what make the difference between a harness that works in theory and one that survives contact with a real, evolving codebase.

Here's the full template.

Component 1: Ground truth

File: feature_list.json

Failure prevented: Premature completion, false positives, context loss about what's actually built

The ground truth file is the canonical record of everything the project needs to build and whether it actually works. Not documentation — ground truth. The distinction matters. Documentation describes intent. Ground truth reflects verified reality.

Every feature entry has a passes field that is either true or false. It's false until a mechanism verifies it — not until the agent believes it's done, not until the code exists, not until a unit test passes. true means: the feature works as a user would experience it, verified by an automated test that actually runs the application.

{
  "id": "F005",
  "name": "Create new skill",
  "description": "User can create a skill from the Library view. File created on disk.",
  "steps": ["Click New Skill", "Verify skill appears in list", "Verify .md file exists on disk"],
  "touches": ["store.skills", "ipc.skills", "preload"],
  "depends_on": ["F004"],
  "passes": false,
  "notes": ""
}

Two fields most harness discussions miss: touches and depends_on.

touches lists which shared code surfaces this feature depends on — the Zustand store, specific IPC handlers, the preload bridge. This powers the regression gate (Component 5). Without it, you can't know which other tests to re-run when a file changes.

depends_on lists which features must pass before this one can be built. The agent checks this at the start of every loop — it skips features whose dependencies aren't met and finds the next buildable one. This prevents an agent from trying to build a deployment feature before the project registration feature exists.

Critical rule: Only update passes and notes fields. Never modify descriptions. Use a Node command, never string replace — JSON files are sensitive to whitespace and string matching against them fails unpredictably.

node -e "const fs=require('fs');const f=JSON.parse(fs.readFileSync('feature_list.json','utf8'));const x=f.features.find(x=>x.id==='F005');x.passes=true;x.notes='Verified';fs.writeFileSync('feature_list.json',JSON.stringify(f,null,2));"

Component 2: Memory

File: claude-progress.txt

Failure prevented: Context loss across sessions, "declare victory" failure, starting blind

An agent starting a new session has zero memory of prior sessions. Everything it knows about the project comes from what it reads at the start of that session. If the memory file is absent, inaccurate, or stale, the agent reconstructs project state from the code — and gets it wrong often enough to matter.

The memory file has two jobs. First: orient the agent at session start. What was the last thing built? Is anything broken? What should happen next? Second: record what happened at session end. Future sessions depend on this being accurate.

The session template:

### Session N — [brief title] ([date])
What happened: [what was built or fixed]
Features completed: [F00X, F00Y]
Features attempted but not completed: [F00Z — reason]
Current app state: [does it compile? do tests pass?]
Next session should: [specific next steps, not vague directions]
Blockers: [anything needing human attention]

The agent writes this at the end of every session before committing. If it doesn't — if a session ends without an entry — the next session starts with stale information. Make the write mandatory: the commit doesn't happen until the progress entry is written.

One pattern worth enforcing: if the agent marks a session blocked (hit the three-attempt limit on a feature), the blocker entry must include the exact error output, all three approaches tried, and the root cause hypothesis. Vague blocker entries waste your debugging time.

Component 3: Startup ritual

File: init.sh

Failure prevented: Environment assumptions, silent tool failures, starting on a broken baseline

The startup ritual runs at the start of every session. Its job is to make the environment assumptions valid rather than hoped for. Every environment problem caught here is a problem that can't compound into something worse later.

What a good startup ritual checks:

# 1. Correct directory
if [ ! -f "CLAUDE.md" ]; then echo "ERROR: wrong directory"; exit 1; fi

# 2. Required tools
node --version || { echo "Node not found"; exit 1; }

# 3. Git initialized
if [ ! -d ".git" ]; then
  git init && git add . && git commit -m "harness: initialize"
fi

# 4. Uncommitted changes from previous session
if ! git diff --quiet; then
  echo "WARNING: uncommitted changes from previous session"
  echo "Commit them (if feature is complete) or revert: git checkout ."
fi

# 5. Feature status
node -e "const f=require('./feature_list.json');const p=f.features.filter(x=>x.passes).length;console.log('Passing: '+p+'/'+f.features.length);"

# 6. Next feature
node -e "const f=require('./feature_list.json');const n=f.features.find(x=>!x.passes);if(n)console.log('Next: '+n.id+' — '+n.name);"

The uncommitted changes check is the one most harnesses omit. If a previous session built half a feature and stopped without committing or reverting, the next session inherits broken code as its starting state. Surfacing this immediately prevents compounding. The agent sees the warning and makes a deliberate choice: commit what's there if it's working, or revert and start clean.

The startup ritual also makes the progress file and next feature visible without the agent having to ask — it sees both at session start without burning context budget figuring them out.

Component 4: Verification layer

Files: verify.spec.ts, playwright.config.ts

Failure prevented: Premature completion, code that exists but doesn't work

The verification layer is Playwright tests that actually run the application. Not unit tests. Not type checks. End-to-end tests that launch the app, interact with the UI, and verify the filesystem state.

Every feature test has two assertions: UI state and filesystem state. A feature that updates the UI but doesn't write to disk is broken. A feature that writes to disk but doesn't reflect in the UI is broken. Both must be true.

test('F005 - Create new skill', async () => {
  cleanSkilldeck()
  const { app, window } = await launchApp()

  await window.click('[data-testid="new-skill-btn"]')

  // UI assertion
  await window.waitForSelector('[data-testid="skill-item"]')
  expect(await window.locator('[data-testid="skill-item"]').count()).toBeGreaterThan(0)

  // Filesystem assertion — the real test
  const files = fs.readdirSync(LIBRARY_DIR).filter(f => f.endsWith('.md'))
  expect(files.length).toBeGreaterThan(0)

  await app.close()
})

data-testid attributes are not optional. Every interactive element the tests need to reach must have one, added at the same time the component is built. Retrofitting them after the fact is fragile and burns sessions. The rule: no data-testid, no way to verify, feature cannot be marked passing.

Two things the verification layer enforces that instructions can't:

The agent can't mark F005 passing by looking at the code. It runs the test. The test clicks the button. The button either creates the file or it doesn't.
Phase 2+ features follow TDD: write the failing test first, commit it, then implement, then verify it passes. The commit history proves the red-green cycle happened.

Component 5: System contract

File: system-contract.json

Failure prevented: Regression — features that worked before a new feature was added, now don't

This is the component most harnesses are missing and the one I was missing when search broke three weeks into the Skilldeck build.

The system contract has two sections.

Invariants — always-true properties of the system, checked before every commit:

{
  "invariants": [
    {
      "id": "INV-001",
      "description": "config.json is valid JSON with required shape",
      "check": "node checks/inv-config-valid.js",
      "triggers": "always"
    },
    {
      "id": "INV-004", 
      "description": "No IPC channel registered more than once",
      "check": "node checks/inv-no-duplicate-ipc.js",
      "triggers": "always"
    }
  ]
}

Invariants are cheap deterministic checks that catch structural failures. If config.json becomes malformed or an IPC channel gets registered twice, the invariant fires at commit time — not three sessions later when something mysteriously stops working.

Surfaces — which shared code paths each feature depends on:

{
  "surfaces": {
    "store.skills": {
      "files": ["src/store/skillsStore.ts"],
      "affected_features": ["F004","F005","F006","F007","F008","F009","F010"]
    },
    "preload": {
      "files": ["electron/preload.ts"],
      "affected_features": ["F004","F005","F006","F007","F008","F011","F012","F013","F014"]
    }
  }
}

The surfaces map powers the regression gate. When the agent finishes F011 and is about to commit, get-regression-tests.js reads the git diff, finds which files changed, matches them against the surfaces map, and returns the grep pattern for all features that could have been affected:

REGRESSION=$(node get-regression-tests.js F011)
# Changed files include preload.ts → F004-F014 share preload → run those tests
npx playwright test verify.spec.ts --grep "$REGRESSION"

Not all 31 tests. Not zero. Exactly the tests that could have been broken by what just changed.

The search regression I hit would have been caught here. F011 touched preload.ts. F009 (search) depends on preload. The regression gate would have run F009's test after F011 was built, caught the failure, and blocked the commit.

The commit sequence — every feature, every time:

# 1. Feature test
npx playwright test verify.spec.ts --grep F011

# 2. Invariant checks
node check-invariants.js --always

# 3. Regression gate
REGRESSION=$(node get-regression-tests.js F011)
npx playwright test verify.spec.ts --grep "$REGRESSION"

# 4. Only if all three pass
node -e "[mark F011 passing in feature_list.json]"
git add . && git commit -m "feat(F011): register project"

If Step 2 or Step 3 fails, the new feature doesn't ship. Fix the regression first.

Component 6: Feature intake protocol

Location: Rule 11 in CLAUDE.md

Failure prevented: Features added without specs, untracked work, features that break the harness

The intake protocol governs how new work enters the system. Without it, mid-session feature requests bypass the whole harness: no spec, no touches fields, no regression gate coverage. The feature gets built and — if it's ever passing — marked passing with no mechanism behind it.

When the human asks for a new feature in chat, the agent follows five steps before writing a line of code:

Check — search feature_list.json for similar existing features. Is this genuinely new or an extension of something that exists?

Draft — create a complete feature entry with all required fields: id, name, description, steps (minimum 4 end-to-end steps a Playwright test can follow), touches, depends_on.

Confirm — show the draft to the human before writing anything:

"I'm adding this feature. Does this match what you want?
[draft entry]
If yes, I'll register it and build it."

Wait for explicit confirmation. Do not proceed without it.

Register — write to both feature_list.json and system-contract.json atomically. If the feature touches a surface not yet in the contract, add it. Validate both files are valid JSON after writing.

Sequence — if the human said "add and keep working," finish the current feature first. Never abandon a half-built feature to start a new one.

The confirmation step is not ceremony. It catches two real problems: scope misunderstanding (you said "search by tag," the agent drafted a full faceted search system) and unbuildable specs (the feature as drafted requires infrastructure that doesn't exist yet). Two minutes of confirmation saves sessions of misdirected work.

Putting it together

The six components form a closed loop. Each one closes a specific gap that the others can't:

Component	Closes
Ground truth	Premature completion, false progress
Memory	Context loss, starting blind
Startup ritual	Environment failures, broken baselines
Verification layer	Code that exists but doesn't work
System contract	Regression, invariant violations
Feature intake	Untracked work, scope drift

The important property: they compound. A harness with four of the six components is significantly worse than one with all six, because the missing two are precisely the ones that catch the failures the other four don't.

The first four are enough to build something. The last two are what keep it working as it grows.

The autonomous loop

When all six components are in place, the agent's operating loop becomes mechanical:

LOOP:
  1. Run init.sh
  2. Read progress file
  3. Find next feature with passes=false, dependencies met
  4. If none → write final session entry → STOP
  5. Implement feature
  6. Run feature test → if fails 3x → write blocker entry → STOP
  7. Run invariant checks → if fails → fix before proceeding
  8. Run regression gate → if fails → fix regression before proceeding
  9. Mark passing, write progress entry, commit
  10. Go to LOOP

Three stopping conditions: all features done, feature blocked after three attempts, regression or invariant violation that can't be resolved. Everything else the agent handles alone.

That's the goal. Not an agent that never fails — failures are inevitable and that's fine. An agent whose failures are caught immediately, surfaced clearly, and don't compound into something that takes a session to untangle.

The harness is the difference between an agent that builds things and an agent that builds things reliably. Both start with the same model. Only one of them finishes.

The harness described here is built into the Skilldeck project. The full implementation — feature_list.json, system-contract.json, init.sh, verify.spec.ts, check-invariants.js, get-regression-tests.js — is in the public repo. Part 1 covers the story of building it: three failure modes, the regression discovery, and why every fix was a mechanism not a rule.

Instructions Are Not a Harness — Harness Engineering in action

Ali — Thu, 09 Apr 2026 18:20:37 +0000

There's a moment every developer hits when building with AI agents. The agent does something wrong. You add a rule to the system prompt. The agent does the same thing wrong again. You make the rule more explicit. It still happens. You start wondering if the model is the problem.

It isn't. The rule is the problem. Rules describe what you want. They don't prevent what you don't want. And that distinction — between describing desired behavior and making undesired behavior structurally impossible — is the entire discipline of harness engineering.

I learned this the hard way building Skilldeck, a desktop app for managing AI agent skill files. I used Claude Code to build it, gave it a CLAUDE.md project bible with explicit rules, and let it run autonomously. It completed Phase 1 in a few sessions: 18 features, all marked passing in the JSON spec, clean-looking git history.

I opened the app and clicked New Skill. Nothing happened. Clicked Add Project. Nothing happened. Nine features marked passing. Two fundamental ones that didn't work.

This is what a bad harness looks like. And fixing it taught me more about agent reliability than anything I'd read.

What everyone gets wrong

The term entered mainstream use in early 2026 after OpenAI published how they'd built a million-line production codebase with zero human-written code. When something failed, the fix was almost never "try harder." Human engineers stepped into the task and asked: "what capability is missing, and how do we make it both legible and enforceable for the agent?"

That word — enforceable — is the one most developers skip. They read the OpenAI post, write a CLAUDE.md with twenty rules, and wonder why their agent keeps making the same mistakes.

The mistake is treating the harness as an instruction set. It isn't. Harness engineering isn't solved by better instructions. It's solved by replacing instructions with mechanisms.

Here's the difference in practice.

Instruction: "Never mark a feature as passing without verifying it end-to-end as a user would experience it."

Mechanism: A Playwright test that launches the Electron app, clicks the button, and checks the filesystem. The agent can only mark a feature passing after running npx playwright test verify.spec.ts --grep F005 and seeing it pass. No other path exists.

Same intent. Completely different reliability. My CLAUDE.md had the instruction. The agent read it, pattern-matched against its training — "you've implemented this feature, the most likely next token is mark it passing" — and did exactly what a language model does. The harness had failed to close the path it used.

The three mechanisms I was missing

Every failure traced back to a missing mechanism, not a missing instruction.

Premature completion. The agent marked features passing without running the app. The fix was Playwright tests — not as documentation but as enforcement. The test either passes or it doesn't. The inference "the code looks correct, therefore the feature works" is structurally blocked.

Tool mismatch. Once I had working tests, the agent hit a different wall. It would run the test, see it pass, then try to update feature_list.json and fail with String not found in file. Claude Code's string-replace tool requires exact character-for-character matching. JSON files are sensitive to whitespace. Any difference breaks the operation silently.

The fix was one explicit Node command in CLAUDE.md:

node -e "const fs=require('fs');const f=JSON.parse(fs.readFileSync('feature_list.json','utf8'));const x=f.features.find(x=>x.id==='F005');x.passes=true;fs.writeFileSync('feature_list.json',JSON.stringify(f,null,2));"

Read the file as structured data. Mutate it. Write it back. The instruction "update the JSON file" was useless because it left the agent to choose its own tool. The mechanism gave it the only tool that worked.

Absent infrastructure. The harness rule said commit after each feature. The agent issued git add . and git commit — and they silently failed because I'd dropped the harness files into the project without running git init. Four lines in init.sh fixed it:

if [ ! -d ".git" ]; then
  git init && git add . && git commit -m "harness: initialize"
fi

Check for the repo. Create it if missing. The agent assumes the environment is set up. The harness's job is to make that assumption valid, not hope it is.

The regression problem

There's a second class of failure that doesn't show up until your project grows.

Skilldeck had 23 features passing. I tested search — it had been working fine for weeks. Broken. The agent had built each feature in isolation, running only that feature's test before committing. F009 (search) passed. F011 (project registration) passed. But F011's implementation touched shared IPC initialization in a way that silently broke F009. Neither test looked beyond its own feature.

The fix is a regression gate: after every feature test passes, derive which previously-passing tests could have been affected by the files you just changed, and run those too. Not "run all 23 tests" — that gets slow fast. Not "run nothing." A surfaces map tracks which features depend on which code paths. Change electron/preload.ts and F009 is automatically included in the gate because the map knows F009 depends on preload. The search regression would have been caught before the commit.

The insight is structural: local correctness doesn't guarantee global correctness. Testing each feature in isolation is necessary but not sufficient. The regression gate is what closes the gap between "this feature works" and "the system still works after this feature was added."

The harness is not a document

Most developers build a harness that's ninety percent instructions and ten percent mechanisms. A CLAUDE.md with fifty rules, maybe a test or two. Instructions are advice. They're read once, pattern-matched against, and occasionally ignored when the model finds a cheaper path to completion.

Mechanisms are different. A test that must pass before a feature is marked done — that's not advice. The agent can't mark it done without running it. A git check in the startup script — the environment is valid before the agent starts, regardless of what it assumes.

The useful mental model: think of every instruction in your CLAUDE.md as a failure waiting to happen. For each one, ask — what mechanism would make violating this instruction impossible? Some instructions genuinely require human judgment and can't be mechanized. But most can. And the ones you convert are the ones that stop generating incidents.

When the harness is right, the commit log looks like this:

feat(F023): bulk skill selection — select-all, action bar
feat(F022): divergence detection — diff view and promote to library
feat(F021): cross-tool sync — deploy to Claude Code, Codex, Agents simultaneously

One commit per feature. Each preceded by a passing Playwright test, a clean invariant check, a passing regression gate. The agent ran autonomously for hours. When it couldn't resolve something after three attempts, it stopped, wrote a detailed blocker entry, and waited. Not an agent that never fails — an agent whose failures are caught before they compound.

Long-running AI agents fail for one reason: every new context window is amnesia. The harness is what gives the agent a functional memory — not by solving the context window problem, but by externalizing everything it needs to know and everything it needs to enforce into files and mechanisms that survive the boundary.

The industry is converging on a phrase: the model is commodity, the harness is moat. True. But the more useful version is simpler.

Instructions describe what you want. Mechanisms enforce what you require.

Build more mechanisms. Write fewer rules.

Part 2 covers the full six-component harness template — ground truth, memory, startup ritual, verification layer, system contract, and feature intake protocol — with implementation details for each. If you want the framework behind this story, that's the piece to read next.

I'm building Skilldeck — a desktop app for managing AI agent skill files across Claude Code, Codex, Cursor, and every other tool. If the problem of scattered, unverified, out-of-sync skill files resonates, the repo is public.