Agent = Model + Harness. If you're not the model, you're the harness.
If you have shipped an AI agent that worked brilliantly in a demo and then fell apart in production, you have already met the harness problem. The model was probably fine. The scaffolding around it — instructions, tools, loops, verification, logging — was not.
1. Start Here
New to AI agents? Read this section first. Skip ahead if you already know what an LLM, tool call, or agent loop is.
What is an LLM? (The brain)
An LLM (Large Language Model) — like GPT-4o, Claude, or Gemini — is a powerful text prediction system. You give it a message, it gives you a reply. By itself, an LLM:
- Cannot remember anything between conversations
- Cannot open files, run code, or browse the web on its own
- Cannot take actions in the real world
- Cannot know if its answer was actually correct
Analogy: Think of an LLM as an incredibly smart consultant locked in a room with no phone, no computer, and no memory of your last meeting. They can answer any question brilliantly — but only using what's on the piece of paper you slide under the door.
What is an AI Agent? (Brain + hands)
An AI Agent is an LLM that has been given the ability to take actions: read files, write code, run commands, call APIs. The key differences from a chatbot:
| Chatbot | Agent | |
|---|---|---|
| Steps | One reply | Multi-step chain |
| Tools | None | Shell, browser, APIs, files |
| Memory | None | Persisted across steps |
| Verification | None | Checks its own work |
| Goal | Answer once | Complete a task |
Analogy: Now the consultant has a phone, a laptop, and a to-do list. They can look things up, send emails, and keep working until the project is finished.
What is the Harness?
The harness is everything you build around the LLM to make an agent work safely and reliably. The LLM is the brain; the harness is the operating system, the safeguards, the toolbox, and the rulebook.
Analogy — the robot suit: Imagine Iron Man. Tony Stark (the LLM) is the genius — but without the suit (the harness), he's just a man in a t-shirt. The suit provides tools (repulsors, flight), safety systems (shields, ejection), memory (JARVIS), constraints (altitude limits), and observability (HUD). The suit doesn't make Tony smarter — it makes his intelligence actually useful in the world.
The five things a harness does:
- Gives the agent its instructions and knowledge — system prompts, documentation links, task files
- Gives the agent its tools — shell, file editor, API caller; controls which tools are available per step
- Runs the loop — calls the model, executes tool calls, feeds results back, repeats until done; sets step/cost/time limits
- Checks the work — runs tests, linters, type-checkers before the agent says "done"; the agent can't fake this
- Keeps records — every tool call, every model response, every error is logged for debugging and auditing
According to LangChain's research: simply changing the harness while keeping the same model moved a coding agent from rank #30 to rank #5 on a public benchmark.
Quick check before continuing:
- What can an LLM not do by itself? (take actions, remember things, access the internet)
- What makes something an "agent" vs a chatbot? (uses tools and loops until task is done)
- What is the "harness"? (everything around the LLM — prompts, tools, loop, checks, logging)
How the agent loop works
2. What Is Harness Engineering?
Harness engineering is the practice of designing everything around an AI model — the instructions you give it, the tools you connect, the memory system, the sandbox it runs in, the verification checks, the cost limits, and the logging — so that a raw LLM becomes a dependable agent that can complete real tasks reliably.
Shared insight across all sources: The gap between what today's models can do and what you see them do is largely a harness gap. A decent model with a great harness beats a great model with a bad one.
3. The Agent Formula
Agent = Model + Harness
A raw LLM produces text. It becomes an agent only when a harness supplies:
- State — memory across steps
- Tool execution — the ability to act
- Termination condition — what "done" means
- Feedback loops — knowing if actions worked
- Enforceable constraints — rules it cannot break
Anthropic is explicit: when you evaluate "an agent" you are evaluating harness + model together — the two cannot be separated in practice.
4. Anatomy of an Agent Harness
The harness is every piece of code, configuration, and execution logic that is not the model itself.
| Component | What it does |
|---|---|
| System prompts | Role definition, output format constraints, planning reminders injected each step |
| Tools & MCP | APIs, filesystem, browser, DB queries — with clear schemas and failure contracts |
| Sandbox / runtime | Docker containers, VMs — isolated and disposable per task |
| Orchestration | Subagent spawning, handoffs, model routing, parallel worker coordination |
| Hooks & middleware | Compaction triggers, lint/test gates, continuation checks between steps |
| Observability | Full traces, span timings, cost per run, tool-level failure attribution |
5. LangChain Harness Diagrams
LangChain's anatomy post (Vivek Trivedy, March 2026) uses three core figures:
Figure 1 — Agent Harness: hub around the model
Figure 2 — Desired behaviour → harness design
Figure 3 — Model–harness co-evolution loop
Figure 4 — ReAct loop
Figure 5 — Long-horizon stack (compounding primitives)
Key insight: The best harness for your task is not always the one the model was post-trained on. Swapping only the harness moved a coding agent from Top 30 to Top 5 on Terminal Bench 2.0.
6. CAR Model: Control · Agency · Runtime
A rigorous lens for designing and reporting agent systems:
| Dimension | Plain English | Examples |
|---|---|---|
| Control | The rules — what's the agent allowed to do? | System prompt, guardrails, budget limits, stop conditions |
| Agency | The hands — what tools can it use? | Tool schemas, permission levels, planning logic |
| Runtime | The environment — where does it run? | Loop, sandboxes, memory files, crash recovery |
Debugging heuristic: When your agent misbehaves, ask — is this a Control problem (bad rules), an Agency problem (wrong tools), or a Runtime problem (environment issue)?
7. Four Levers of Harness Design
Four independent levers you can tune without swapping the underlying model:
Lever 1 — Context Design (what the model knows and when)
Context is finite. Treat it as a curated, perishable resource:
- Short, stable entry point:
AGENTS.md(~100 lines) as table of contents, not encyclopedia -
todo.mdpattern: agent re-reads its plan each step to prevent goal drift - Structured
docs/directory enforced by linters and a recurring "doc-gardening" agent
Lever 2 — Tool Selection (least privilege per phase)
A framework registers tools; a harness decides which tools this agent may use for this specific task:
- Gate access by workflow phase: plan → execute → verify
- Specify what happens on failure: retry with backoff, fall back, or escalate to human
- Tool allowlists, retry budgets, and clear failure contracts are harness-layer concerns
Lever 3 — Constraint Management (budgets, loops, "done")
- Max iterations and time budgets injected into the prompt
- Loop detection — if the same action repeats N times, escalate
- Self-verification gate —
declare_completeonly after deterministic checks pass - Cost cap — abort or ask for human approval when token spend exceeds threshold
Lever 4 — Production Hardening
- Sandboxing — ephemeral containers per task; if one dies, re-provision
- Authentication — inject credentials securely, never hardcode
- Memory compaction — summarise old turns; evict stale content on policy
- Structured logging — every tool call with input, output, latency, cost
- Eval suites — regression tests before any prompt or model change
- Human-on-the-loop — escalation path when confidence or budget thresholds trip
8. The Agentic Loop
The harness owns the loop — its termination rules, its stuck-detection, its escalation paths. "Call the model until done" is not a harness; it is a bug.
Harness Maturity Ladder
| Level | Name | What it produces |
|---|---|---|
| H0 | Minimal | Model output only. No traces, no failure attribution. OK for low-stakes prototypes. |
| H1 | Basic tools | Reproduction logs and tool-call traces. You can replay what happened. |
| H2 | Verification | Deterministic requirement checks and failure attribution. Diffs are reviewable. |
| H3 | Full substrate | Structured verification reports, intervention logs, entropy audits, signed episode package. |
For beginners: Start at H1 (basic traces). Move to H2 once you add verification gates. Target H3 only for production or auditability requirements.
9. Evaluation Harness vs Agent Harness
Two distinct concepts that share vocabulary:
| Agent harness | Evaluation harness | |
|---|---|---|
| Purpose | Run agents in dev or production | Score agents on fixed task suites |
| Output | User-facing results, PRs, real actions | Pass/fail rates, regression diffs, metrics |
| Graders | Inline verification (tests, linters) | Deterministic checks + LLM-as-judge + human rubrics |
| Isolation | Per-task sandboxes in production | Fresh containers per episode; no shared state |
| Tools | LangGraph, Claude Agent SDK, custom loops | LangSmith, Braintrust, Phoenix |
Anthropic's advice: Run end-to-end evals in fully isolated environments. Maintain both capability evals (can the agent do the task?) and regression evals (did a harness or prompt change break what already worked?). Invest most energy in high-quality test cases — not the framework.
10. Framework vs Harness
A common mistake is treating an agent framework as a complete harness. They operate at different layers:
| Agent Framework | Your Harness (on top) |
|---|---|
| LangChain, LangGraph, CrewAI, AutoGen, BAML | What "done" means and max iterations |
| Tool registration and memory APIs | Context eviction & compaction policy |
| Reusable primitives and integrations | Per-task tool allowlists & retry semantics |
| Provides the loop — does not decide loop rules | Verification gates before committing output |
| Excellent, but not sufficient alone | Observability, cost caps, escalation paths |
Key finding: A well-built harness can make a weaker model outperform a stronger one in a poorly-scaffolded system. Reliability is a property of the full model–harness–environment system — not of weights alone.
11. SPDD: Structured Prompt-Driven Development
From the Thoughtworks / Martin Fowler article: while individual developers get faster with AI, teams often don't — because more code is generated faster than it can be aligned, reviewed, and governed. SPDD treats prompts as first-class engineering artifacts.
The REASONS Canvas
A seven-section template that forces clarity before any code is generated:
| Section | Dimension | What it captures |
|---|---|---|
| R | Requirements | What problem are we solving? What is the definition of done? |
| E | Entities | Domain objects and their relationships |
| A | Approach | Strategy and design pattern choice |
| S | Structure | Where in the system does this change live? |
| O | Operations | Concrete, testable implementation steps with method signatures |
| N | Norms | Cross-cutting standards: naming, observability, defensive coding |
| S | Safeguards | Non-negotiable invariants: security rules, performance limits |
The SPDD Workflow
Golden rule: When reality diverges, fix the prompt first — then update the code.
Three Core Skills
Abstraction first — design before you generate. Clarify what objects exist, how they collaborate, and where boundaries are, before writing any prompt. Beginner version: draw a box diagram before writing a single prompt.
Alignment — lock intent before you write code. Make "what we will do / won't do" explicit upfront. Agree on standards and hard constraints. Beginner version: write acceptance criteria in the prompt before asking for code.
Iterative review — turn output into a controlled loop. Reviews focus on intent, not "spot the bug." Beginner version: always read the generated code against the spec before running it.
When SPDD fits
| Rating | Scenario |
|---|---|
| ⭐⭐⭐⭐⭐ | Repeated business logic, many similar APIs, long-term maintainability |
| ⭐⭐⭐⭐⭐ | High compliance, strict architectural or security rules |
| ⭐⭐⭐⭐ | Team delivery — traceable, reviewable changes |
| ⭐⭐ | Emergency hotfixes (speed > discipline) |
| ⭐ | Exploratory spikes, one-off scripts, poorly-defined domains |
12. Proven Production Patterns
Pattern 1 — The Tight Loop (Ralph loop)
Each iteration spawns a fresh agent instance with clean context:
- Pick one story from
prd.json(passes: false) - Implement it
- Run tests + typecheck + lint
- On pass: git commit, mark story done
- Repeat
Memory persists outside the context window — in git, progress.txt, and AGENTS.md — never in bloated chat history.
Pattern 2 — Deterministic Back-Pressure
Tests, type-checkers, and linters act as governors: the model cannot declare victory without hard signals from the environment. Per OpenAI's Codex experiment:
- Enforce boundaries centrally (custom lints with remediation messages baked into the error text)
- Allow autonomy locally
- Humans steer, agents execute, infrastructure enforces taste
Pattern 3 — Agent-Legible Repositories (OpenAI)
Anything the agent cannot see in-context effectively does not exist — Slack threads, verbal agreements, Google Docs. Push all knowledge into:
- Versioned
docs/with structured sub-folders - Execution plans with decision logs
- Auto-generated schema files
- Observable logs (LogQL, PromQL) the agent can query directly
- A dedicated "doc-gardening" agent that opens PRs when docs drift from code
Pattern 4 — Sensors with Self-Correction Guidance (Fowler)
From Böckeler's sensors article — sensor error messages must include how to fix the problem, not just flag it:
# Bad sensor message (human-readable only):
ERROR: Function exceeds complexity threshold
# Good sensor message (agent-readable with guidance):
ERROR: Function exceeds complexity threshold.
[Guidance: Refactor into smaller functions with single responsibilities.
If a refactoring is truly not possible in this case, you may increase
the threshold by 1, but document why in a comment.]
Sensor schedule:
| When | Type | Examples |
|---|---|---|
| During coding session | Fast, continuous | TypeScript checker, ESLint, dependency-cruiser, test suite |
| CI pipeline | Post-integration | Same sensors run on clean infrastructure |
| Recurring / scheduled | Drift detection | Security review, modularity review, dependency freshness |
13. Examples & Code Snippets
AGENTS.md — progressive disclosure
# Agent operating instructions
## Repository map
- Architecture: docs/ARCHITECTURE.md
- Quality bar: docs/QUALITY.md
- Active plan: docs/exec-plans/active/<task-id>.md
## Rules (always apply)
1. Run tests before declaring a task complete.
2. Never commit secrets or edit .env files.
3. Update todo.md every step — re-read it before each tool call.
## Tools
- Use bash only inside the project sandbox.
- For UI bugs: use browser skill (screenshot + DOM snapshot).
## When stuck
- Read docs/references/ for library-specific LLM guides.
- If verification fails twice, stop and summarize blockers.
docs/ layout (OpenAI agent-first repo)
AGENTS.md # ~100 lines — table of contents only
docs/
├── design-docs/
│ ├── index.md
│ └── core-beliefs.md
├── exec-plans/
│ ├── active/
│ ├── completed/
│ └── tech-debt-tracker.md
├── generated/
│ └── db-schema.md
├── references/ # LLM-friendly library guides
├── DESIGN.md
├── FRONTEND.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md
todo.md — fight goal drift
# Task: Add rate limiting to POST /api/ingest
## Global goal
Ship rate limiting with tests; no breaking changes to existing clients.
## Current focus
- [ ] Implement token-bucket middleware
- [x] Add unit tests for limiter
- [ ] Update OpenAPI spec
- [ ] Run full test suite + lint
## Notes from last step
- Used 100 req/min per API key; config via RATE_LIMIT_RPM env.
Phase-based tool gating
from enum import Enum
class Phase(Enum):
PLAN = "plan"
EXECUTE = "execute"
VERIFY = "verify"
TOOLS_BY_PHASE = {
Phase.PLAN: ["read_file", "search_repo", "write_todo"],
Phase.EXECUTE: ["read_file", "write_file", "run_shell", "run_tests"],
Phase.VERIFY: ["read_file", "run_shell", "run_tests", "declare_complete"],
}
def tools_for_phase(phase: Phase, all_tools: list) -> list:
allowed = set(TOOLS_BY_PHASE[phase])
return [t for t in all_tools if t.name in allowed]
Minimal harness loop
MAX_STEPS = 25
MAX_COST_USD = 2.00
def run_agent(task: str, model, tools, sandbox) -> AgentResult:
state = HarnessState(task=task, phase=Phase.PLAN, step=0, cost=0.0)
messages = [system_prompt(), user_message(task), load_todo_md()]
while state.step < MAX_STEPS and state.cost < MAX_COST_USD:
state.step += 1
active_tools = tools_for_phase(state.phase, tools)
response = model.complete(messages, tools=active_tools)
state.cost += response.usage.cost_usd
trace.record(response)
if response.tool_calls:
for call in response.tool_calls:
if violates_policy(call):
messages.append(policy_violation_message(call))
continue
result = sandbox.execute(call)
messages.append(tool_result_message(call, result))
state.phase = Phase.VERIFY if should_verify(state) else Phase.EXECUTE
continue
if state.phase == Phase.VERIFY and verification_passed(sandbox):
return AgentResult.ok(trace=trace.export())
if loop_detected(trace):
return AgentResult.escalate("Stuck in loop", trace=trace.export())
return AgentResult.timeout(trace=trace.export())
Verification gate — declare_complete
def declare_complete(sandbox) -> tuple[bool, str]:
"""Harness-owned: model cannot skip this."""
checks = [
("pytest", ["pytest", "-q", "--tb=short"]),
("lint", ["ruff", "check", "."]),
("types", ["mypy", "src/"]),
]
failures = []
for name, cmd in checks:
proc = sandbox.run(cmd, timeout=300)
if proc.exit_code != 0:
failures.append(f"{name} failed:\n{proc.stderr[-2000:]}")
if failures:
return False, "\n\n".join(failures)
return True, "All verification checks passed."
ESLint sensor with self-correction guidance
// .eslintrc — AI failure-mode rules
{
"rules": {
"max-params": ["error", 4],
"max-lines": ["error", 300],
"max-lines-per-function": ["error", 50],
"complexity": ["error", 10]
}
}
// Custom formatter message for no-explicit-any:
// "We want things to be typed to make it easier to avoid errors, especially
// for key concepts. But avoid cluttering the codebase with unnecessary types.
// Make a judgment call. If you choose not to introduce a type, suppress it
// with: // eslint-disable-next-line @typescript-eslint/no-explicit-any -- (reason)"
dependency-cruiser layer rule
// .dependency-cruiser.js
{
name: "clients-no-services",
comment: "API clients must not depend on the orchestration layer above them. " +
"[Layers: routes -> services -> clients + domain]",
severity: "error",
from: { path: "^server/clients/", pathNot: "/__tests__/" },
to: { path: "^server/services/" },
}
14. Harness Engineering Checklist
Use this before shipping any agent to production or before changing a prompt or model:
| Area | Question to answer |
|---|---|
| Termination | What counts as "done"? Is there a max-step and cost budget? |
| Context | What is always in-window vs progressively loaded on demand? |
| Tools | Least privilege per task phase? Retry semantics on failure? |
| Verification | Deterministic checks gate completion — not just model confidence? |
| Observability | Full trace: tool calls, inputs, outputs, latency, cost? |
| Evals | Regression suite runs before any prompt or model change? |
| Safety | Guardrails intercept before any user-visible or prod-affecting action? |
| Recovery | Agent can resume after tool or environment failure? |
| Memory | Context compaction policy defined? Eviction priority order explicit? |
| Reporting | Documenting model AND harness together, not model alone? |
Takeaway: Harness engineering is unglamorous — it is the scaffolding, feedback loops, and governance work that turns a capable model into a reliable, auditable agent. The teams doing it seriously are the ones building things that last.
15. Glossary
Core Concepts
| Term |
|---|
| LLM (Large Language Model) |
| Foundation Model |
| AI Agent |
| Harness |
| System Prompt |
| Tool / Tool Call |
| MCP (Model Context Protocol) |
| Context Window |
Harness Mechanics
| Term |
|---|
| Agentic Loop |
| Guardrail |
| Verification Gate |
| Sandbox |
| Compaction |
| Observability / Tracing |
| Orchestration |
| Hook |
Files and Patterns
| Term |
|---|
| AGENTS.md |
| todo.md |
| prd.json |
| Ralph Loop |
| ReAct Loop |
| Ratchet (Osmani) |
| Eval / Evaluation |
| HarnessCard |
Fowler's Vocabulary
| Term |
|---|
| Feedforward / Guides |
| Feedback / Sensors |
| Computational sensor |
| Inferential sensor |
| Harnessability |
16. Further Reading
Primary Sources
- LangChain — The Anatomy of an Agent Harness (Vivek Trivedy, March 2026)
- Martin Fowler — Harness engineering for coding agent users (Birgitta Böckeler, April 2026)
- Martin Fowler — Maintainability sensors for coding agents (Birgitta Böckeler, May 2026)
- Martin Fowler — Structured Prompt-Driven Development (SPDD) (Thoughtworks, 2026)
- OpenAI — Harness engineering: leveraging Codex in an agent-first world (Ryan Lopopolo, February 2026)
- Addy Osmani — Agent Harness Engineering (synthesis, 2026)















Top comments (1)
This is an amazing breakdown of harness engineering! I love the formula 'Agent = Model + Harness.' It makes so much sense because people always blame the AI model when a project fails, but usually, it's just the tools and infrastructure around it that broke.
Since building a production-ready harness layer with tools and sandboxes is so much work, developers should check out vectoralix.com. It is a managed platform built specifically for publishing and hosting MCP servers. It handles the infrastructure part of the harness so you can just focus on your agent's core logic. Thanks for writing this!