The model is smart. The harness makes it reliable. Build the environment around Claude Code, Codex, or any coding agent so multi-session work finishes with proof — not vibes.
What you’ll understand
- Why the same model fails or succeeds based on harness — not IQ
- The five subsystems : instructions, state, verification, scope, lifecycle
- AGENTS.md as map , not encyclopedia — progressive disclosure via docs/
- The 16-step session lifecycle agents should follow
- Planner/generator/evaluator splits for long runs
- Copy-ready templates to drop into your repo today
Introduction — it’s a harness problem
You give Claude or GPT a real task. It reads files, writes code, looks productive. Then it skips a step, breaks tests, says “done” — and nothing works. You spend more time rescuing than if you’d coded it yourself.
That’s not a model problem. It’s a harness problem.
Anthropic ran a controlled experiment: same model (Opus 4.5), same prompt (“build a 2D retro game editor”). Without harness: ~$9 in 20 minutes, broken output. With harness (planner + generator + evaluator): ~$200 in 6 hours, playable game. The model didn’t change. The environment did.
OpenAI reported the same shift with Codex: in a well-harnessed repo, reliability moves from “unreliable” to production-grade — not a marginal tweak, a qualitative jump.
Harness engineering = designing everything the model runs inside: instructions, state files, verification gates, scope boundaries, session lifecycle, hooks, sandboxes, observability.
Agent = Model + Harness
If you're not the model, you're the harness.
Harness pattern — task to verified done
Modern agent platforms such as OpenClaw extend this idea by providing persistent agent sessions, structured workflows, and runtime orchestration around foundation models. In practice, the harness often determines whether an agent completes work reliably or simply generates plausible output.
Link: https://techlatest.net/support/openclaw-support/
Part 1 — The harness pattern
You give a task. The agent:
- Reads harness files (not your Slack thread)
- Runs init.sh — install, health check
- Picks one unfinished feature
- Implements with verification loop
- Stops only when tests/lint/types pass
The model decides what code to write.
The harness governs when, where, and how — and when “done” is allowed.
Part 2 — Five subsystems
Five subsystems — instructions through lifecycle
| Subsystem | Job | Artifacts |
|-----------|-----|-----------|
| Instructions | What to do, in what order, what to read first | `AGENTS.md`, `CLAUDE.md`, `docs/` |
| State | What's done, in progress, next | `feature_list.json`, `claude-progress.md`, git log |
| Verification | Proof before victory | tests, lint, typecheck, smoke, e2e |
| Scope | One feature at a time; real definition of done | feature list as machine-readable boundary |
| Lifecycle | Clean start and handoff | `init.sh`, wrap-up checklist, safe commit |
The harness doesn’t make the model smarter. It makes output reliable.
Part 3 — Without harness vs with harness
Without vs with harness — two session story
Without: Session 2 has no memory. Agent re-does work or wanders. You merge broken code.
With: Session 2 reads claude-progress.md, continues feature F03, verifies before claiming done. You review, not rescue.
Part 4 — AGENTS.md: map, not encyclopedia
The “one giant AGENTS.md” approach fails predictably:
- Context is scarce — a 1,000-line manual crowds out the task
- Everything “important” means nothing is
- It rots — agents can’t tell what’s still true
Fix: ~100 lines AGENTS.md as table of contents. Deep truth lives in structured docs/ — design docs, architecture, exec plans, quality grades. Agent starts small, reads on demand.
OpenAI’s Codex team treats docs/ as a system of record ; linters and doc-gardening agents keep it fresh.
Part 5 — Session lifecycle (16 steps)
Session lifecycle flow
Start: Read harness → init.sh → progress log → feature list → git log
Select: Pick exactly one unfinished feature
Execute: Implement → verify → fix loop until green → record evidence
Wrap: Update progress + feature list → note broken/unverified → commit when safe to resume
Without harness, step “verify” becomes “agent says it looks fine.” With harness, it’s tests pass, lint clean, types check.
Part 6 — Scope and feature lists
feature_list.json is a harness primitive — machine-readable scope the agent can't hand-wave away.
Rules:
- One passes: false feature active at a time
- No rewriting the list to hide unfinished work
- passes: true Only with evidence (test name, date, log snippet)
See feature_list.json.
{
"app": "knowledge-base-desktop",
"features": [
{
"id": "F01",
"title": "Import local markdown files",
"passes": true,
"evidence": "tests/import.test.ts — 2026-06-01"
},
{
"id": "F02",
"title": "Document library list view",
"passes": true,
"evidence": "e2e/library.spec.ts"
},
{
"id": "F03",
"title": "Index documents for search",
"passes": false,
"notes": "in progress — indexer stub only"
},
{
"id": "F04",
"title": "Grounded Q&A with citations",
"passes": false,
"notes": "blocked on F03"
}
],
"next": "F03"
}
Part 7 — Verification and early victory
Agents declare victory too early because confidence ≠ correctness. Fixes:
- Runnable proof required (not “I think it works”)
- Full pipeline runs — unit + lint + typecheck + smoke
- Separate evaluator agent — generation ≠ grading (Anthropic harness pattern)
Planner · generator · evaluator
Part 8 — Hooks and the ratchet
Hooks enforce what prompts merely suggest: pre-commit typecheck, block rm -rf, grep for .skip(, require approval before push.
Ratchet rule: every agent mistake becomes a permanent constraint :
- Agent commented out a test → AGENTS.md rule + hook
- Agent ignored architecture layer → custom linter
- Stale docs → doc-gardening agent opens fix PR
Harness is shaped by your failure history — you can’t download someone else’s.
Part 9 — Agent legibility
If the agent can’t see it in-repo at runtime, it doesn’t exist. Slack decisions, Google Docs, tribal knowledge — illegible. Versioned markdown, schemas, plans, generated DB docs — legible.
Push context into the repo over time. Boring, composable stacks often beat clever abstractions agents can’t inspect.
Agents can only reason over information they can access at runtime. Retrieval systems such as Instant RAGFlow help surface relevant documentation, knowledge bases, and project context without forcing every detail into the model’s context window.
Techlatest.net - Instant RAGFlow: Ready-to-Use AI Knowledge Retrieval Engine
Part 10 — Production patterns (Codex / Claude Code)
Mature harnesses add:
- Per-worktree app boot — agent drives UI via Chrome DevTools MCP
- Local observability stack — LogQL/PromQL in the loop
- Layered architecture — mechanical dependency rules + structural tests
- Garbage collection — golden principles + recurring refactor agents
- Minimal merge gates — high throughput; fix forward when agent volume exceeds human attention
Humans steer at intent and acceptance criteria. Agents execute and self-review in loops.
As teams move beyond single-agent workflows, orchestration platforms such as CrewAI Studio help coordinate planners, implementers, reviewers, and specialized agents while maintaining visibility into long-running tasks.
Techlatest.net - AI Agents using CrewAI Studio & Jupyter with GPU support
Workflow platforms such as Dify AI provide a practical way to package harnessed agents into production applications, combining tool integrations, evaluation flows, and operational monitoring.
Techlatest.net - Dify AI: Build & Launch GenAI Apps
Part 11 — Quick start (four files)
Drop into project root:
├── AGENTS.md
├── init.sh
├── feature_list.json
└── claude-progress.md
Drop templates into repo
Copy from examples/. Sessions stabilize immediately vs prompt-only.
Part 12 — Hands-on session
./init.sh # bootstrap + health
# agent picks ONE feature
npm test && npm run lint # verification gate
# update progress + feature_list
git commit # clean handoff
init.sh session start Verification gate — fail then pass Commit handoff
Many teams develop and validate harness workflows inside reproducible AI workbenches with integrated notebooks, terminals, and GPU access before deploying them into production agent environments.
Techlatest.net - Jupyter Python Notebook
Part 13 — Capstone context (knowledge base app)
The learn-harness-engineering course builds one Electron knowledge-base app across six projects — import docs, index, grounded Q&A with citations. Each project adds harness mechanisms; the app evolves as skills grow.
Same pattern works for any real repo: measured weak vs strong harness diff, not doc count.
Part 14 — Learning path (12 + 6)
Lectures L01–L12: capability gap → harness definition → repo as truth → progressive disclosure → multi-session state → init phase → scope → feature lists → verification → e2e → observability → clean handoff
Projects P01–P06: prompt-only vs rules-first → agent-readable workspace → continuity → runtime feedback → self-verification → full capstone
Part 15 — Who this is for
Yes: engineers using coding agents daily; tech leads owning agent reliability; builders who’ll let agents edit real repos
No: zero-code AI intro; prompt-only hobbyists; teams unwilling to add harness files to git
Requires: terminal, git, at least one of Claude Code / Codex / comparable agent CLI
Summary
Harness engineering is the discipline of making agents finish real work: map-not-encyclopedia instructions, disk-persisted state, verification before “done”, one-feature scope, structured session lifecycle, hooks that ratchet on every failure. The model gets the headlines. The harness gets the merge.
Thank you so much for reading
Like | Follow | Subscribe to the newsletter.
Catch us on
Website: https://www.techlatest.net/
Newsletter: https://substack.com/@parvezmohammed
Twitter: https://twitter.com/TechlatestNet
LinkedIn: https://www.linkedin.com/in/techlatest-net/
YouTube:https://www.youtube.com/@techlatest_net/
Blogs: https://medium.com/@techlatest.net
Reddit Community: https://www.reddit.com/user/techlatest_net/












Top comments (0)