Distilled from the Learn Harness Engineering course by WalkingLabs, which synthesizes harness engineering theory and practice from OpenAI, Anthropic, and industry practitioners.
The model is smart. The harness makes it reliable.
โ A well-harnessed agent spent \$200/6hrs and built a working game. Without a harness, same model spent \$9/20min and produced garbage. The model didn't change. The harness did.
๐ Hard numbers:
- ๐ SWE-bench Verified top agents: ~50-60% pass rate โ on curated tasks with clear descriptions. In real repos? Even lower.
- ๐๏ธ OpenAI's million-line experiment: 3 engineers + Codex โ ~1,500 PRs over 5 months, ~3.5 PRs/person/day โ but only after investing heavily in harness infrastructure.
- ๐ A team added
AGENTS.mdto a FastAPI project: same model went from failing all 3 runs to succeeding all 3 runs, with ~60% better context efficiency. - ๐ A TypeScript + React project went from 20% success rate (bare repo) โ 100% (full harness). Model never changed.
๐ Table of Contents
- ๐ What Is Harness Engineering?
- ๐งฉ The 5 Subsystems of a Harness
- ๐ Quick Start: Your Minimal Harness (Do This Today)
-
๐ The 12 Core Principles
- Principle 1: Strong Models โ Reliable Execution
- Principle 2: A Harness Is 5 Subsystems, Not a Better Prompt
- Principle 3: The Repo Is the Single Source of Truth
- Principle 4: Split Instructions โ Don't Use One Giant File
- Principle 5: Persist Context Across Sessions
- Principle 6: Initialize Before Every Session
- Principle 7: One Feature at a Time โ No Overreach (WIP=1)
- Principle 8: Feature Lists Are Harness Primitives
- Principle 9: Don't Let Agents Declare Victory Early
- Principle 10: Only Full-Pipeline Verification Counts
- Principle 11: Make the Agent's Runtime Observable
- Principle 12: Every Session Must Leave a Clean State
- ๐ The Agent Session Lifecycle
- โ๏ธ Without Harness vs. With Harness
- ๐ Additional Templates Worth Adding
- ๐ฏ Key Takeaways
- ๐ฌ How to Diagnose Harness Quality
- ๐ References
๐ What Is Harness Engineering?
Harness engineering is building a complete working environment around an AI coding agent so it produces reliable results. It's NOT about writing better prompts. It's about designing the system the model operates inside.
You โ give task โ Agent reads harness files โ Agent executes
|
harness governs every step:
โโโ Instructions: what to do, in what order
โโโ Scope: one feature at a time
โโโ State: progress log, feature list, git history
โโโ Verification: tests, lint, type-check
โโโ Lifecycle: init at start, clean state at end
|
v
Agent stops ONLY when
verification passes
๐งฉ The 5 Subsystems of a Harness
Every effective harness has exactly five parts:
| # | Subsystem | Job | Key Files |
|---|---|---|---|
| 1 | ๐ Instructions | Tell the agent what to do, in what order, what to read first |
AGENTS.md, CLAUDE.md, docs/
|
| 2 | ๐พ State | Track what's done, in-progress, and next. Persisted to disk so the next session picks up exactly where the last left off |
claude-progress.md, feature_list.json, git log
|
| 3 | โ Verification | Only passing tests count as evidence. Agent cannot declare victory without proof | tests, lint, type-check, smoke runs |
| 4 | ๐ฏ Scope | Constrain agent to ONE feature at a time. No overreach. No half-finishing three things | feature boundaries, definition of done |
| 5 | ๐ Session Lifecycle | Initialize at start. Clean up at end. Leave a clean restart path |
init.sh, handoff notes, clean commits |
๐ Quick Start: Your Minimal Harness (Do This Today)
Drop these 4 files into your project root:
YOUR PROJECT ROOT
โโโ AGENTS.md โ the agent's operating manual
โโโ init.sh โ runs install + verify + health check
โโโ feature_list.json โ what features exist, which are done
โโโ claude-progress.md โ what happened each session
โโโ src/ โ your actual code
1. AGENTS.md โ The Operating Manual
# Agent Instructions
## Before Starting Any Work
1. Run `./init.sh` to verify environment health
2. Read `claude-progress.md` for context from last session
3. Read `feature_list.json` to see what's done and what's next
4. Check `git log --oneline -10` for recent changes
## Rules
- Work on exactly ONE feature at a time
- Never declare "done" without passing tests
- Run the full test suite before committing
- Update `claude-progress.md` after every session
- Update `feature_list.json` when a feature status changes
- Commit only when the project is in a clean, resumable state
## Verification Checklist
- [ ] All tests pass
- [ ] Linter passes
- [ ] Type-check passes
- [ ] Feature works as specified
2. init.sh โ Environment Health Check
#!/bin/bash
set -e
echo "=== Installing dependencies ==="
npm install
echo "=== Running tests ==="
npm test
echo "=== Type checking ==="
npx tsc --noEmit
echo "=== Environment healthy ==="
3. feature_list.json โ Machine-Readable Scope
{
"features": [
{ "id": "F001", "name": "User login", "status": "done", "tests": "src/auth.test.ts" },
{ "id": "F002", "name": "Document import", "status": "in-progress", "tests": "src/import.test.ts" },
{ "id": "F003", "name": "Search", "status": "not-started", "tests": null }
]
}
4. claude-progress.md โ Session Memory
# Progress Log
## Session 3 โ 2026-04-20
- Completed: F001 (user login) โ all tests passing
- In progress: F002 (document import) โ parser done, validation pending
- Blocked: none
- Next session should: finish F002 validation logic, then run full test suite
๐ The 12 Core Principles
Principle 1: Strong Models โ Reliable Execution
Problem: Models ace benchmarks but fail on real multi-file engineering tasks.
Why: Real tasks need multi-step coordination, not one-shot answers. Agents fail at five specific layers:
- Task specification โ vague requirements โ agent guesses wrong
- Context provision โ implicit conventions not written down โ agent violates them
- Execution environment โ missing deps, wrong versions โ agent wastes context on setup
- Verification feedback โ no tests โ agent says "done" when it's not
- State management โ no progress tracking โ next session starts from zero
Action: When things fail, fix the harness first, not the model. Attribute every failure to one of these five layers. One AGENTS.md file might be more effective than upgrading to a more expensive model.
Diagnostic Loop: Execute โ observe failure โ attribute to a specific harness layer โ fix that layer โ re-execute. After a few rounds, your harness gets stronger and agent performance stabilizes.
Principle 2: A Harness Is 5 Subsystems, Not a Better Prompt
Problem: People think "better prompt = better results."
Why: Prompts don't persist state, verify work, or control scope.
Action: Implement all 5 subsystems: instructions, state, verification, scope, lifecycle.
Real data: A team added subsystems one at a time to a TypeScript + React project:
- Stage 1 (bare repo): 20% success rate
- Stage 2 (+AGENTS.md): 60%
- Stage 3 (+verification commands): 80%
- Stage 4 (+progress files): 80-100%
The feedback/verification subsystem has the lowest investment and highest ROI. Start there.
Constrain, don't micromanage. Use executable rules, not step-by-step instructions. OpenAI: "enforce invariants, don't micromanage implementation."
Harness debt is real. Harness rots like code does. Audit regularly โ remove outdated rules, update stale docs.
Principle 3: The Repo Is the Single Source of Truth
Problem: If the agent can't see it in the repo, it doesn't exist. Your Slack history, Jira tickets, Confluence pages, verbal agreements โ the agent sees none of it.
Why: Agents don't remember conversations. They read files. They can't ask a colleague.
Action: Put everything the agent needs INTO the repo: instructions, progress, feature definitions, docs.
Cold-Start Test โ open a fresh agent session (no verbal context) and see if it can answer:
- What is this system?
- How is it organized?
- How do I run it?
- How do I verify it?
- What's the current progress?
If it can't answer โ your repo has blind spots. Where the map is blank, the agent guesses โ and guessing creates bugs.
ACID Principles for Agent State:
-
Atomicity โ Each logical operation gets one git commit. If it fails midway,
git stashto roll back. - Consistency โ All tests pass, lint reports zero errors before committing. Inconsistent states don't get committed.
- Isolation โ Multiple agents use separate progress files or git branches.
- Durability โ Cross-session knowledge must be persisted to files. What's only in memory doesn't count.
Place knowledge near code. A 50-line ARCHITECTURE.md in src/api/ is more useful than a 500-page Confluence doc nobody maintains.
Principle 4: Split Instructions โ Don't Use One Giant File
Problem: One massive instruction file overwhelms the agent's context. A team's AGENTS.md grew from 50 โ 600 lines. Their agent success rate dropped from 72% to 45%.
Why: Three killers:
- Lost in the Middle Effect (Liu et al., 2023): LLMs use info in the middle of long texts far less effectively than at the beginning or end. A critical rule at line 300 of 600 will likely be ignored.
- Context budget eaten alive: A 600-line file consumes 10-20K tokens before the agent even starts working.
- Priority conflicts: Hard constraints, soft guidelines, and historical notes all look identical. The agent can't distinguish.
Action: Keep AGENTS.md at 50-200 lines as a routing file. Split details into topic docs:
AGENTS.md (50-200 lines โ overview + hard constraints + links)
โโโ docs/api-patterns.md โ read when adding endpoints
โโโ docs/database-rules.md โ read when modifying DB operations
โโโ docs/testing-standards.md โ reference when writing tests
โโโ docs/deployment.md โ read only for deployment tasks
Rules for the entry file:
- Put critical rules at the top or bottom, never the middle
- Max 15 non-negotiable hard constraints
- Every instruction should have: a source (why), applicability (when), and expiry (when to remove)
- Audit regularly โ delete outdated rules like you delete unused dependencies
Result: After splitting, a team's success rate went from 45% โ 72%, and security constraint compliance went from 60% โ 95%.
Principle 5: Persist Context Across Sessions
Problem: Session 2 starts fresh. Agent has no memory of Session 1. Tasks over 30 minutes see failure rates spike sharply without state.
Why: Without persisted state, the agent re-does work, reverses deliberate decisions, or drifts from requirements like a game of telephone.
Context Anxiety (discovered by Anthropic): When agents sense context is running low, they rush โ skipping verification, choosing simple solutions over correct ones. Like guessing on remaining exam questions when time is almost up.
Action: Use three continuity artifacts:
-
claude-progress.mdโ What's done, what's in progress, what's blocked, next steps -
DECISIONS.mdโ Why option B was chosen over A, with date and reasoning. Prevents the next session from reversing deliberate choices. - Git commits as checkpoints โ Commit after each atomic unit of work with clear messages explaining what and why.
Key metric: Rebuild Cost โ how long a new session takes to reach an executable state. Good harnesses: ~3 minutes. Bad harnesses: 15-20 minutes.
Real data: A 12-feature blog system over 5 sessions:
- Without progress files: 58% features completed, 43% hidden defect rate
- With progress files: 100% features completed, 8% hidden defect rate, rebuild time reduced ~78%
WITHOUT STATE WITH STATE
Session 1: does work Session 1: does work, writes progress
Session 2: starts from zero Session 2: reads progress, continues
Result: rework & drift Result: steady forward progress
Principle 6: Initialize Before Every Session
Problem: Agent starts coding in a broken environment (missing deps, failing tests). Code written before the test framework is configured = code without verification.
Why: Initialization and implementation have different goals. Mixing them = doing both poorly. Anthropic's data: dedicated initialization phase โ 31% higher feature completion in multi-session scenarios.
Action: Treat initialization as a separate phase. The first session does ONLY initialization โ no feature code.
Bootstrap Contract โ initialization is complete when four conditions are met:
- โ
Can start (
make setupsucceeds) - โ Can test (at least one example test passes)
- โ Can see progress (task breakdown file exists)
- โ Can pick up next steps (progress file is readable)
Warm start >> Cold start. Use project templates to preset standard structure. Starting from a template is 10x better than starting from an empty directory.
Time invested in initialization is fully recovered in the next 3-4 sessions.
Principle 7: One Feature at a Time โ No Overreach (WIP=1)
Problem: Agents try to do 3 things at once, finish none of them properly. Context capacity C divided by k tasks = each task gets C/k attention. When that drops below minimum, nothing finishes.
Why: Overreach and under-finish amplify each other. More code written โ more features completed โ Anthropic found they're negatively correlated. Agents using "small next step" (WIP=1) show 37% higher task completion.
Action: Enforce WIP=1 (Work-in-Progress Limit from Kanban). Write it explicitly in AGENTS.md:
## Work Rules
- Work on one feature at a time
- Only start the next feature after the current one passes end-to-end verification
- Don't "also refactor" feature B while implementing feature A
Real data: REST API with 8 features:
- Unconstrained: 5 features activated, ~800 lines across 12 files, 20% end-to-end pass โ 37.5% completion by session 3
- WIP=1: 1 feature at a time, ~200 lines across 4 files, 100% pass โ 87.5% completion by session 4
"Do less but finish" always beats "do more but leave half-done."
Principle 8: Feature Lists Are Harness Primitives
Problem: Vague scope = vague results. "Shopping cart mostly done" โ what does "mostly" mean? Which tests passed? Nobody knows.
Why: Feature lists aren't memos โ they're the backbone of the harness. The scheduler, verifier, and handoff reporter all depend on them.
Action: Every feature entry must be a triple: (behavior description, verification command, current state)
{
"id": "F03",
"behavior": "POST /cart/items with {product_id, quantity} returns 201",
"verification": "curl -X POST localhost:3000/api/cart/items -d '{\"product_id\":1}' | jq .status",
"state": "not_started"
}
State machine โ four states, controlled by the harness:
-
not_startedโactive(agent picks it up) -
activeโpassing(only when verification command succeeds โ pass-state gating) -
activeโblocked(dependency issue) -
passingis irreversible โ once verified, it stays verified
Granularity rule: Each feature should be completable in one session. "User can add items to cart" = good. "Implement the shopping cart" = too broad. "Create the name field on the Cart model" = too narrow.
Result: Structured feature lists show 45% higher completion rate than free-form tracking, with zero duplicate implementations.
Principle 9: Don't Let Agents Declare Victory Early
Problem: Agent says "done!" but tests fail, edge cases are broken, or code doesn't compile. Anthropic found that agents confidently praise their own work โ you must separate "the person who does the work" from "the person who checks the work."
Why: Verification Gap = the gap between the agent's confidence and actual correctness. This is the #1 failure mode.
Action: Write an explicit Definition of Done for every task. Not "add a search feature" but:
Completion criteria:
- New endpoint GET /api/search?q=xxx
- Supports pagination, default 20 items
- Results include highlighted snippets
- All new code passes pytest
- Type checking passes (mypy --strict)
"Done" = verification passes. "The code looks fine" does NOT count. curl returns 201 DOES count.
Principle 10: Only Full-Pipeline Verification Counts
Problem: Agent runs one unit test. Claims everything works.
Why: Partial verification misses integration issues, type errors, and regressions.
Action: Require the full pipeline: tests + lint + type-check + build + smoke run. All must pass.
Principle 11: Make the Agent's Runtime Observable
Problem: You can't fix what you can't see. Missing observability wastes 30-50% of session time on redundant diagnosis.
Why: Without observability: agents can't distinguish "correct" from "looks correct," retries become blind guesses, and evaluation becomes subjective.
Action: Build two layers of observability:
Layer 1 โ Runtime signals: Application lifecycle, feature path execution, errors with full context. The harness collects these automatically โ don't rely on the agent to log its own actions.
Layer 2 โ Process observability:
- Sprint contracts โ before each task, define: what to change, what NOT to change, pass criteria, exclusions
- Evaluator rubrics โ turn "is it good?" into structured scoring:
| Dimension | A | B | C | D |
|---|---|---|---|---|
| Code correctness | All tests pass | Main flow passes | Partial pass | Build fails |
| Test coverage | Main + edge cases | Main flow only | Skeleton only | No tests |
Real data: A "dark mode" task โ without observability: 3-4 blind retries, 45 minutes. With sprint contract + rubric: 1 iteration, 15 minutes. 3x efficiency.
Principle 12: Every Session Must Leave a Clean State
Problem: Agent leaves half-committed code, broken tests, uncommitted changes. "Clean up later" = never clean up.
Why: Entropy grows by default (Lehman's Laws). Without cleanup, a 12-week project degrades:
| Week 1 | Week 12 (no cleanup) | Week 12 (with cleanup) | |
|---|---|---|---|
| Build pass rate | 100% | 68% | 97% |
| Test pass rate | 100% | 61% | 95% |
| Session startup | 5 min | 60+ min | 9 min |
Action: Session completion = task passes verification AND clean state check passes. Five dimensions:
## Session Exit Checklist
- [ ] Build passes (npm run build)
- [ ] All tests pass (npm test)
- [ ] Feature list + progress updated
- [ ] No debug code remaining (console.log, debugger, TODO)
- [ ] Standard startup path works (npm run dev)
Quality Document โ maintain an active scorecard for each module (A/B/C/D). New sessions read it and know where to prioritize.
Periodically simplify the harness. As models improve, some constraints become unnecessary overhead. Monthly: disable one harness component, run benchmarks. If results don't degrade โ remove it permanently.
๐ The Agent Session Lifecycle (Follow This Every Time)
START
1. Agent reads AGENTS.md
2. Agent runs init.sh (install, verify, health check)
3. Agent reads claude-progress.md (what happened last time)
4. Agent reads feature_list.json (what's done, what's next)
5. Agent checks git log (recent changes)
SELECT
6. Agent picks exactly ONE unfinished feature
7. Agent works ONLY on that feature
EXECUTE
8. Agent implements the feature
9. Agent runs verification (tests, lint, type-check)
10. If verification fails โ fix and re-run
11. If verification passes โ record evidence
WRAP UP
12. Agent updates claude-progress.md
13. Agent updates feature_list.json
14. Agent records what's still broken or unverified
15. Agent commits (only when safe to resume)
16. Agent leaves clean restart path for next session
โ๏ธ Without Harness vs. With Harness
| Without Harness | With Harness | |
|---|---|---|
| ๐ก Session start | Agent starts fresh, no context | Agent reads progress, picks up where it left off |
| ๐ก Scope | Agent does random things | Agent works on one specific feature |
| ๐ก "Done" | Agent says "looks good" | Tests pass, lint clean, types check |
| ๐ก Session end | Half-committed mess | Clean state, progress logged, ready for next session |
| ๐ก Your role | Rescue & cleanup | Review & approve |
| ๐ก Result | You spend more time fixing than if you did it yourself | Agent does the work, you verify the result |
๐ Additional Templates Worth Adding
DECISIONS.md โ Prevent Next Session From Reversing Your Choices
# Design Decisions
## 2026-04-15: Use Redis for user preferences caching
- Reason: High read frequency (every API call), small data size
- Rejected: PostgreSQL materialized view (high change frequency)
- Constraint: Cache TTL of 5 minutes, active invalidation on write
## 2026-04-18: Use Vitest over Jest
- Reason: Native ESM support, faster execution
- Constraint: All test files use .test.ts extension
Sprint Contract โ For Complex Tasks
# Sprint Contract: Dark Mode Support
## Scope
- Modify the theme toggle component
- Update global CSS variables
- Add dark mode tests
## Verification Standards
- Visual regression tests pass
- Main flow E2E tests pass
- No flash of unstyled content (FOUC)
## Exclusions
- NOT handling print styles
- NOT handling third-party component dark mode
Richer feature_list.json โ With Verification Evidence
{
"features": [
{
"id": "F01",
"behavior": "POST /api/register with {email, password} returns 201",
"verification": "curl -X POST /api/register -d '{\"email\":\"test@example.com\"}' | jq .status",
"state": "passing",
"evidence": "commit abc123, test output log"
},
{
"id": "F02",
"behavior": "GET /api/search?q=xxx returns paginated results (default 20)",
"verification": "pytest tests/test_search.py -x",
"state": "active",
"evidence": null
}
]
}
๐ฏ Key Takeaways
- ๐ง The harness doesn't make the model smarter โ it makes its output reliable
- ๐ Everything the agent needs must live in the repo (if it can't see it, it doesn't exist)
- ๐ฏ One feature at a time โ scope is the most underrated lever
- โ Never trust "done" without verification evidence โ tests must pass
- ๐งน Every session must leave a clean state โ the next session's success depends on it
- ๐พ State must persist to disk โ memory dies between sessions, files don't
- ๐ Initialize before work, verify during work, clean up after work โ this is the lifecycle
- ๐ When things fail, fix the harness first โ attribute failures to one of five layers, fix that layer, re-run
- 1๏ธโฃ WIP=1 โ finish one feature before starting the next. Less code, more completed features
- โ ๏ธ Harness debt is real โ audit and simplify regularly. Rules that helped last month may be unnecessary overhead today
- ๐ "Do less but finish" beats "do more but leave half-done" โ always
๐ฌ How to Diagnose Harness Quality
Isometric Model Control: Keep the model fixed. Remove harness components one at a time. Measure which removal causes the biggest performance drop. That's your bottleneck โ focus your effort there.
| Remove this... | If performance drops significantly... |
|---|---|
| AGENTS.md | Your instructions are critical |
| Verification commands | Your feedback loop is carrying the weight |
| Progress files | Your state management is essential |
| Feature list | Your scope control is doing the heavy lifting |
| init.sh | Your initialization is preventing cascading failures |
๐ References
- OpenAI: Harness Engineering
- Anthropic: Effective Harnesses for Long-Running Agents
- Anthropic: Harness Design for Long-Running Apps
- LangChain: The Anatomy of an Agent Harness
- Thoughtworks: Harness Engineering
- HumanLayer: Skill Issue โ Harness Engineering for Coding Agents
- Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023)
- Course Website
- GitHub Repo
If you found this helpful, let me know by leaving a ๐ or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! ๐
Top comments (0)