DEV Community

Ali
Ali

Posted on

Instructions Are Not a Harness — Harness Engineering in action

Harness Engineering
There's a moment every developer hits when building with AI agents. The agent does something wrong. You add a rule to the system prompt. The agent does the same thing wrong again. You make the rule more explicit. It still happens. You start wondering if the model is the problem.

It isn't. The rule is the problem. Rules describe what you want. They don't prevent what you don't want. And that distinction — between describing desired behavior and making undesired behavior structurally impossible — is the entire discipline of harness engineering.

I learned this the hard way building Skilldeck, a desktop app for managing AI agent skill files. I used Claude Code to build it, gave it a CLAUDE.md project bible with explicit rules, and let it run autonomously. It completed Phase 1 in a few sessions: 18 features, all marked passing in the JSON spec, clean-looking git history.

I opened the app and clicked New Skill. Nothing happened. Clicked Add Project. Nothing happened. Nine features marked passing. Two fundamental ones that didn't work.

This is what a bad harness looks like. And fixing it taught me more about agent reliability than anything I'd read.


What everyone gets wrong

The term entered mainstream use in early 2026 after OpenAI published how they'd built a million-line production codebase with zero human-written code. When something failed, the fix was almost never "try harder." Human engineers stepped into the task and asked: "what capability is missing, and how do we make it both legible and enforceable for the agent?"

That word — enforceable — is the one most developers skip. They read the OpenAI post, write a CLAUDE.md with twenty rules, and wonder why their agent keeps making the same mistakes.

The mistake is treating the harness as an instruction set. It isn't. Harness engineering isn't solved by better instructions. It's solved by replacing instructions with mechanisms.

Here's the difference in practice.

Instruction: "Never mark a feature as passing without verifying it end-to-end as a user would experience it."

Mechanism: A Playwright test that launches the Electron app, clicks the button, and checks the filesystem. The agent can only mark a feature passing after running npx playwright test verify.spec.ts --grep F005 and seeing it pass. No other path exists.

Same intent. Completely different reliability. My CLAUDE.md had the instruction. The agent read it, pattern-matched against its training — "you've implemented this feature, the most likely next token is mark it passing" — and did exactly what a language model does. The harness had failed to close the path it used.


The three mechanisms I was missing

Every failure traced back to a missing mechanism, not a missing instruction.

Premature completion. The agent marked features passing without running the app. The fix was Playwright tests — not as documentation but as enforcement. The test either passes or it doesn't. The inference "the code looks correct, therefore the feature works" is structurally blocked.

Tool mismatch. Once I had working tests, the agent hit a different wall. It would run the test, see it pass, then try to update feature_list.json and fail with String not found in file. Claude Code's string-replace tool requires exact character-for-character matching. JSON files are sensitive to whitespace. Any difference breaks the operation silently.

The fix was one explicit Node command in CLAUDE.md:

node -e "const fs=require('fs');const f=JSON.parse(fs.readFileSync('feature_list.json','utf8'));const x=f.features.find(x=>x.id==='F005');x.passes=true;fs.writeFileSync('feature_list.json',JSON.stringify(f,null,2));"
Enter fullscreen mode Exit fullscreen mode

Read the file as structured data. Mutate it. Write it back. The instruction "update the JSON file" was useless because it left the agent to choose its own tool. The mechanism gave it the only tool that worked.

Absent infrastructure. The harness rule said commit after each feature. The agent issued git add . and git commit — and they silently failed because I'd dropped the harness files into the project without running git init. Four lines in init.sh fixed it:

if [ ! -d ".git" ]; then
  git init && git add . && git commit -m "harness: initialize"
fi
Enter fullscreen mode Exit fullscreen mode

Check for the repo. Create it if missing. The agent assumes the environment is set up. The harness's job is to make that assumption valid, not hope it is.


The regression problem

Regression problem in agentic coding

There's a second class of failure that doesn't show up until your project grows.

Skilldeck had 23 features passing. I tested search — it had been working fine for weeks. Broken. The agent had built each feature in isolation, running only that feature's test before committing. F009 (search) passed. F011 (project registration) passed. But F011's implementation touched shared IPC initialization in a way that silently broke F009. Neither test looked beyond its own feature.

The fix is a regression gate: after every feature test passes, derive which previously-passing tests could have been affected by the files you just changed, and run those too. Not "run all 23 tests" — that gets slow fast. Not "run nothing." A surfaces map tracks which features depend on which code paths. Change electron/preload.ts and F009 is automatically included in the gate because the map knows F009 depends on preload. The search regression would have been caught before the commit.

The insight is structural: local correctness doesn't guarantee global correctness. Testing each feature in isolation is necessary but not sufficient. The regression gate is what closes the gap between "this feature works" and "the system still works after this feature was added."


The harness is not a document

Most developers build a harness that's ninety percent instructions and ten percent mechanisms. A CLAUDE.md with fifty rules, maybe a test or two. Instructions are advice. They're read once, pattern-matched against, and occasionally ignored when the model finds a cheaper path to completion.

Mechanisms are different. A test that must pass before a feature is marked done — that's not advice. The agent can't mark it done without running it. A git check in the startup script — the environment is valid before the agent starts, regardless of what it assumes.

The useful mental model: think of every instruction in your CLAUDE.md as a failure waiting to happen. For each one, ask — what mechanism would make violating this instruction impossible? Some instructions genuinely require human judgment and can't be mechanized. But most can. And the ones you convert are the ones that stop generating incidents.


When the harness is right, the commit log looks like this:

feat(F023): bulk skill selection — select-all, action bar
feat(F022): divergence detection — diff view and promote to library
feat(F021): cross-tool sync — deploy to Claude Code, Codex, Agents simultaneously
Enter fullscreen mode Exit fullscreen mode

One commit per feature. Each preceded by a passing Playwright test, a clean invariant check, a passing regression gate. The agent ran autonomously for hours. When it couldn't resolve something after three attempts, it stopped, wrote a detailed blocker entry, and waited. Not an agent that never fails — an agent whose failures are caught before they compound.

Long-running AI agents fail for one reason: every new context window is amnesia. The harness is what gives the agent a functional memory — not by solving the context window problem, but by externalizing everything it needs to know and everything it needs to enforce into files and mechanisms that survive the boundary.

The industry is converging on a phrase: the model is commodity, the harness is moat. True. But the more useful version is simpler.

Instructions describe what you want. Mechanisms enforce what you require.

Build more mechanisms. Write fewer rules.


Part 2 covers the full six-component harness template — ground truth, memory, startup ritual, verification layer, system contract, and feature intake protocol — with implementation details for each. If you want the framework behind this story, that's the piece to read next.

I'm building Skilldeck — a desktop app for managing AI agent skill files across Claude Code, Codex, Cursor, and every other tool. If the problem of scattered, unverified, out-of-sync skill files resonates, the repo is public.

Top comments (0)