Akash Thakur

Posted on May 28

Harness Engineering for AI Agents

#ai #agents #llm

Agent = Model + Harness. If you're not the model, you're the harness.

If you have shipped an AI agent that worked brilliantly in a demo and then fell apart in production, you have already met the harness problem. The model was probably fine. The scaffolding around it — instructions, tools, loops, verification, logging — was not.

1. Start Here

New to AI agents? Read this section first. Skip ahead if you already know what an LLM, tool call, or agent loop is.

What is an LLM? (The brain)

An LLM (Large Language Model) — like GPT-4o, Claude, or Gemini — is a powerful text prediction system. You give it a message, it gives you a reply. By itself, an LLM:

Cannot remember anything between conversations
Cannot open files, run code, or browse the web on its own
Cannot take actions in the real world
Cannot know if its answer was actually correct

Analogy: Think of an LLM as an incredibly smart consultant locked in a room with no phone, no computer, and no memory of your last meeting. They can answer any question brilliantly — but only using what's on the piece of paper you slide under the door.

What is an AI Agent? (Brain + hands)

An AI Agent is an LLM that has been given the ability to take actions: read files, write code, run commands, call APIs. The key differences from a chatbot:

	Chatbot	Agent
Steps	One reply	Multi-step chain
Tools	None	Shell, browser, APIs, files
Memory	None	Persisted across steps
Verification	None	Checks its own work
Goal	Answer once	Complete a task

Analogy: Now the consultant has a phone, a laptop, and a to-do list. They can look things up, send emails, and keep working until the project is finished.

What is the Harness?

The harness is everything you build around the LLM to make an agent work safely and reliably. The LLM is the brain; the harness is the operating system, the safeguards, the toolbox, and the rulebook.

Analogy — the robot suit: Imagine Iron Man. Tony Stark (the LLM) is the genius — but without the suit (the harness), he's just a man in a t-shirt. The suit provides tools (repulsors, flight), safety systems (shields, ejection), memory (JARVIS), constraints (altitude limits), and observability (HUD). The suit doesn't make Tony smarter — it makes his intelligence actually useful in the world.

The five things a harness does:

Gives the agent its instructions and knowledge — system prompts, documentation links, task files
Gives the agent its tools — shell, file editor, API caller; controls which tools are available per step
Runs the loop — calls the model, executes tool calls, feeds results back, repeats until done; sets step/cost/time limits
Checks the work — runs tests, linters, type-checkers before the agent says "done"; the agent can't fake this
Keeps records — every tool call, every model response, every error is logged for debugging and auditing

According to LangChain's research: simply changing the harness while keeping the same model moved a coding agent from rank #30 to rank #5 on a public benchmark.

Quick check before continuing:

What can an LLM not do by itself? (take actions, remember things, access the internet)

What makes something an "agent" vs a chatbot? (uses tools and loops until task is done)

What is the "harness"? (everything around the LLM — prompts, tools, loop, checks, logging)

How the agent loop works

2. What Is Harness Engineering?

Harness engineering is the practice of designing everything around an AI model — the instructions you give it, the tools you connect, the memory system, the sandbox it runs in, the verification checks, the cost limits, and the logging — so that a raw LLM becomes a dependable agent that can complete real tasks reliably.

Shared insight across all sources: The gap between what today's models can do and what you see them do is largely a harness gap. A decent model with a great harness beats a great model with a bad one.

3. The Agent Formula

Agent = Model + Harness

A raw LLM produces text. It becomes an agent only when a harness supplies:

State — memory across steps
Tool execution — the ability to act
Termination condition — what "done" means
Feedback loops — knowing if actions worked
Enforceable constraints — rules it cannot break

Anthropic is explicit: when you evaluate "an agent" you are evaluating harness + model together — the two cannot be separated in practice.

4. Anatomy of an Agent Harness

The harness is every piece of code, configuration, and execution logic that is not the model itself.

Component	What it does
System prompts	Role definition, output format constraints, planning reminders injected each step
Tools & MCP	APIs, filesystem, browser, DB queries — with clear schemas and failure contracts
Sandbox / runtime	Docker containers, VMs — isolated and disposable per task
Orchestration	Subagent spawning, handoffs, model routing, parallel worker coordination
Hooks & middleware	Compaction triggers, lint/test gates, continuation checks between steps
Observability	Full traces, span timings, cost per run, tool-level failure attribution

5. LangChain Harness Diagrams

LangChain's anatomy post (Vivek Trivedy, March 2026) uses three core figures:

Figure 1 — Agent Harness: hub around the model

Figure 2 — Desired behaviour → harness design

Figure 3 — Model–harness co-evolution loop

Figure 4 — ReAct loop

Figure 5 — Long-horizon stack (compounding primitives)

Key insight: The best harness for your task is not always the one the model was post-trained on. Swapping only the harness moved a coding agent from Top 30 to Top 5 on Terminal Bench 2.0.

6. CAR Model: Control · Agency · Runtime

A rigorous lens for designing and reporting agent systems:

Dimension	Plain English	Examples
Control	The rules — what's the agent allowed to do?	System prompt, guardrails, budget limits, stop conditions
Agency	The hands — what tools can it use?	Tool schemas, permission levels, planning logic
Runtime	The environment — where does it run?	Loop, sandboxes, memory files, crash recovery

Debugging heuristic: When your agent misbehaves, ask — is this a Control problem (bad rules), an Agency problem (wrong tools), or a Runtime problem (environment issue)?

7. Four Levers of Harness Design

Four independent levers you can tune without swapping the underlying model:

Lever 1 — Context Design (what the model knows and when)

Context is finite. Treat it as a curated, perishable resource:

Short, stable entry point: AGENTS.md (~100 lines) as table of contents, not encyclopedia
todo.md pattern: agent re-reads its plan each step to prevent goal drift
Structured docs/ directory enforced by linters and a recurring "doc-gardening" agent

Lever 2 — Tool Selection (least privilege per phase)

A framework registers tools; a harness decides which tools this agent may use for this specific task:

Gate access by workflow phase: plan → execute → verify
Specify what happens on failure: retry with backoff, fall back, or escalate to human
Tool allowlists, retry budgets, and clear failure contracts are harness-layer concerns

Lever 3 — Constraint Management (budgets, loops, "done")

Max iterations and time budgets injected into the prompt
Loop detection — if the same action repeats N times, escalate
Self-verification gate — declare_complete only after deterministic checks pass
Cost cap — abort or ask for human approval when token spend exceeds threshold

Lever 4 — Production Hardening

Sandboxing — ephemeral containers per task; if one dies, re-provision
Authentication — inject credentials securely, never hardcode
Memory compaction — summarise old turns; evict stale content on policy
Structured logging — every tool call with input, output, latency, cost
Eval suites — regression tests before any prompt or model change
Human-on-the-loop — escalation path when confidence or budget thresholds trip

8. The Agentic Loop

The harness owns the loop — its termination rules, its stuck-detection, its escalation paths. "Call the model until done" is not a harness; it is a bug.

Harness Maturity Ladder

Level	Name	What it produces
H0	Minimal	Model output only. No traces, no failure attribution. OK for low-stakes prototypes.
H1	Basic tools	Reproduction logs and tool-call traces. You can replay what happened.
H2	Verification	Deterministic requirement checks and failure attribution. Diffs are reviewable.
H3	Full substrate	Structured verification reports, intervention logs, entropy audits, signed episode package.

For beginners: Start at H1 (basic traces). Move to H2 once you add verification gates. Target H3 only for production or auditability requirements.

9. Evaluation Harness vs Agent Harness

Two distinct concepts that share vocabulary:

	Agent harness	Evaluation harness
Purpose	Run agents in dev or production	Score agents on fixed task suites
Output	User-facing results, PRs, real actions	Pass/fail rates, regression diffs, metrics
Graders	Inline verification (tests, linters)	Deterministic checks + LLM-as-judge + human rubrics
Isolation	Per-task sandboxes in production	Fresh containers per episode; no shared state
Tools	LangGraph, Claude Agent SDK, custom loops	LangSmith, Braintrust, Phoenix

Anthropic's advice: Run end-to-end evals in fully isolated environments. Maintain both capability evals (can the agent do the task?) and regression evals (did a harness or prompt change break what already worked?). Invest most energy in high-quality test cases — not the framework.

10. Framework vs Harness

A common mistake is treating an agent framework as a complete harness. They operate at different layers:

Agent Framework	Your Harness (on top)
LangChain, LangGraph, CrewAI, AutoGen, BAML	What "done" means and max iterations
Tool registration and memory APIs	Context eviction & compaction policy
Reusable primitives and integrations	Per-task tool allowlists & retry semantics
Provides the loop — does not decide loop rules	Verification gates before committing output
Excellent, but not sufficient alone	Observability, cost caps, escalation paths

Key finding: A well-built harness can make a weaker model outperform a stronger one in a poorly-scaffolded system. Reliability is a property of the full model–harness–environment system — not of weights alone.

11. SPDD: Structured Prompt-Driven Development

From the Thoughtworks / Martin Fowler article: while individual developers get faster with AI, teams often don't — because more code is generated faster than it can be aligned, reviewed, and governed. SPDD treats prompts as first-class engineering artifacts.

The REASONS Canvas

A seven-section template that forces clarity before any code is generated:

Section	Dimension	What it captures
R	Requirements	What problem are we solving? What is the definition of done?
E	Entities	Domain objects and their relationships
A	Approach	Strategy and design pattern choice
S	Structure	Where in the system does this change live?
O	Operations	Concrete, testable implementation steps with method signatures
N	Norms	Cross-cutting standards: naming, observability, defensive coding
S	Safeguards	Non-negotiable invariants: security rules, performance limits

The SPDD Workflow

Golden rule: When reality diverges, fix the prompt first — then update the code.

Three Core Skills

Abstraction first — design before you generate. Clarify what objects exist, how they collaborate, and where boundaries are, before writing any prompt. Beginner version: draw a box diagram before writing a single prompt.
Alignment — lock intent before you write code. Make "what we will do / won't do" explicit upfront. Agree on standards and hard constraints. Beginner version: write acceptance criteria in the prompt before asking for code.
Iterative review — turn output into a controlled loop. Reviews focus on intent, not "spot the bug." Beginner version: always read the generated code against the spec before running it.

When SPDD fits

Rating	Scenario
⭐⭐⭐⭐⭐	Repeated business logic, many similar APIs, long-term maintainability
⭐⭐⭐⭐⭐	High compliance, strict architectural or security rules
⭐⭐⭐⭐	Team delivery — traceable, reviewable changes
⭐⭐	Emergency hotfixes (speed > discipline)
⭐	Exploratory spikes, one-off scripts, poorly-defined domains

12. Proven Production Patterns

Pattern 1 — The Tight Loop (Ralph loop)

Each iteration spawns a fresh agent instance with clean context:

Pick one story from prd.json (passes: false)
Implement it
Run tests + typecheck + lint
On pass: git commit, mark story done
Repeat

Memory persists outside the context window — in git, progress.txt, and AGENTS.md — never in bloated chat history.

Pattern 2 — Deterministic Back-Pressure

Tests, type-checkers, and linters act as governors: the model cannot declare victory without hard signals from the environment. Per OpenAI's Codex experiment:

Enforce boundaries centrally (custom lints with remediation messages baked into the error text)
Allow autonomy locally
Humans steer, agents execute, infrastructure enforces taste

Pattern 3 — Agent-Legible Repositories (OpenAI)

Anything the agent cannot see in-context effectively does not exist — Slack threads, verbal agreements, Google Docs. Push all knowledge into:

Versioned docs/ with structured sub-folders
Execution plans with decision logs
Auto-generated schema files
Observable logs (LogQL, PromQL) the agent can query directly
A dedicated "doc-gardening" agent that opens PRs when docs drift from code

Pattern 4 — Sensors with Self-Correction Guidance (Fowler)

From Böckeler's sensors article — sensor error messages must include how to fix the problem, not just flag it:

# Bad sensor message (human-readable only):
ERROR: Function exceeds complexity threshold

# Good sensor message (agent-readable with guidance):
ERROR: Function exceeds complexity threshold.
[Guidance: Refactor into smaller functions with single responsibilities.
 If a refactoring is truly not possible in this case, you may increase
 the threshold by 1, but document why in a comment.]

Sensor schedule:

When	Type	Examples
During coding session	Fast, continuous	TypeScript checker, ESLint, dependency-cruiser, test suite
CI pipeline	Post-integration	Same sensors run on clean infrastructure
Recurring / scheduled	Drift detection	Security review, modularity review, dependency freshness

13. Examples & Code Snippets

AGENTS.md — progressive disclosure

# Agent operating instructions

## Repository map
- Architecture: docs/ARCHITECTURE.md
- Quality bar: docs/QUALITY.md
- Active plan: docs/exec-plans/active/<task-id>.md

## Rules (always apply)
1. Run tests before declaring a task complete.
2. Never commit secrets or edit .env files.
3. Update todo.md every step — re-read it before each tool call.

## Tools
- Use bash only inside the project sandbox.
- For UI bugs: use browser skill (screenshot + DOM snapshot).

## When stuck
- Read docs/references/ for library-specific LLM guides.
- If verification fails twice, stop and summarize blockers.

docs/ layout (OpenAI agent-first repo)

AGENTS.md              # ~100 lines — table of contents only
docs/
├── design-docs/
│   ├── index.md
│   └── core-beliefs.md
├── exec-plans/
│   ├── active/
│   ├── completed/
│   └── tech-debt-tracker.md
├── generated/
│   └── db-schema.md
├── references/          # LLM-friendly library guides
├── DESIGN.md
├── FRONTEND.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md

todo.md — fight goal drift

# Task: Add rate limiting to POST /api/ingest

## Global goal
Ship rate limiting with tests; no breaking changes to existing clients.

## Current focus
- [ ] Implement token-bucket middleware
- [x] Add unit tests for limiter
- [ ] Update OpenAPI spec
- [ ] Run full test suite + lint

## Notes from last step
- Used 100 req/min per API key; config via RATE_LIMIT_RPM env.

Phase-based tool gating

from enum import Enum

class Phase(Enum):
    PLAN = "plan"
    EXECUTE = "execute"
    VERIFY = "verify"

TOOLS_BY_PHASE = {
    Phase.PLAN:    ["read_file", "search_repo", "write_todo"],
    Phase.EXECUTE: ["read_file", "write_file", "run_shell", "run_tests"],
    Phase.VERIFY:  ["read_file", "run_shell", "run_tests", "declare_complete"],
}

def tools_for_phase(phase: Phase, all_tools: list) -> list:
    allowed = set(TOOLS_BY_PHASE[phase])
    return [t for t in all_tools if t.name in allowed]

Minimal harness loop

MAX_STEPS = 25
MAX_COST_USD = 2.00

def run_agent(task: str, model, tools, sandbox) -> AgentResult:
    state = HarnessState(task=task, phase=Phase.PLAN, step=0, cost=0.0)
    messages = [system_prompt(), user_message(task), load_todo_md()]

    while state.step < MAX_STEPS and state.cost < MAX_COST_USD:
        state.step += 1
        active_tools = tools_for_phase(state.phase, tools)

        response = model.complete(messages, tools=active_tools)
        state.cost += response.usage.cost_usd
        trace.record(response)

        if response.tool_calls:
            for call in response.tool_calls:
                if violates_policy(call):
                    messages.append(policy_violation_message(call))
                    continue
                result = sandbox.execute(call)
                messages.append(tool_result_message(call, result))
            state.phase = Phase.VERIFY if should_verify(state) else Phase.EXECUTE
            continue

        if state.phase == Phase.VERIFY and verification_passed(sandbox):
            return AgentResult.ok(trace=trace.export())
        if loop_detected(trace):
            return AgentResult.escalate("Stuck in loop", trace=trace.export())

    return AgentResult.timeout(trace=trace.export())

Verification gate — declare_complete

def declare_complete(sandbox) -> tuple[bool, str]:
    """Harness-owned: model cannot skip this."""
    checks = [
        ("pytest", ["pytest", "-q", "--tb=short"]),
        ("lint",   ["ruff", "check", "."]),
        ("types",  ["mypy", "src/"]),
    ]
    failures = []
    for name, cmd in checks:
        proc = sandbox.run(cmd, timeout=300)
        if proc.exit_code != 0:
            failures.append(f"{name} failed:\n{proc.stderr[-2000:]}")
    if failures:
        return False, "\n\n".join(failures)
    return True, "All verification checks passed."

ESLint sensor with self-correction guidance

// .eslintrc — AI failure-mode rules
{
  "rules": {
    "max-params": ["error", 4],
    "max-lines": ["error", 300],
    "max-lines-per-function": ["error", 50],
    "complexity": ["error", 10]
  }
}

// Custom formatter message for no-explicit-any:
// "We want things to be typed to make it easier to avoid errors, especially
//  for key concepts. But avoid cluttering the codebase with unnecessary types.
//  Make a judgment call. If you choose not to introduce a type, suppress it
//  with: // eslint-disable-next-line @typescript-eslint/no-explicit-any -- (reason)"

dependency-cruiser layer rule

// .dependency-cruiser.js
{
  name: "clients-no-services",
  comment: "API clients must not depend on the orchestration layer above them. " +
           "[Layers: routes -> services -> clients + domain]",
  severity: "error",
  from: { path: "^server/clients/", pathNot: "/__tests__/" },
  to:   { path: "^server/services/" },
}

14. Harness Engineering Checklist

Use this before shipping any agent to production or before changing a prompt or model:

Area	Question to answer
Termination	What counts as "done"? Is there a max-step and cost budget?
Context	What is always in-window vs progressively loaded on demand?
Tools	Least privilege per task phase? Retry semantics on failure?
Verification	Deterministic checks gate completion — not just model confidence?
Observability	Full trace: tool calls, inputs, outputs, latency, cost?
Evals	Regression suite runs before any prompt or model change?
Safety	Guardrails intercept before any user-visible or prod-affecting action?
Recovery	Agent can resume after tool or environment failure?
Memory	Context compaction policy defined? Eviction priority order explicit?
Reporting	Documenting model AND harness together, not model alone?

Takeaway: Harness engineering is unglamorous — it is the scaffolding, feedback loops, and governance work that turns a capable model into a reliable, auditable agent. The teams doing it seriously are the ones building things that last.

15. Glossary

Core Concepts

Term
LLM (Large Language Model)
Foundation Model
AI Agent
Harness
System Prompt
Tool / Tool Call
MCP (Model Context Protocol)
Context Window

Harness Mechanics

Term
Agentic Loop
Guardrail
Verification Gate
Sandbox
Compaction
Observability / Tracing
Orchestration
Hook

Files and Patterns

Term
AGENTS.md
todo.md
prd.json
Ralph Loop
ReAct Loop
Ratchet (Osmani)
Eval / Evaluation
HarnessCard

Fowler's Vocabulary

Term
Feedforward / Guides
Feedback / Sensors
Computational sensor
Inferential sensor
Harnessability

16. Further Reading

Primary Sources

LangChain — The Anatomy of an Agent Harness (Vivek Trivedy, March 2026)
Martin Fowler — Harness engineering for coding agent users (Birgitta Böckeler, April 2026)
Martin Fowler — Maintainability sensors for coding agents (Birgitta Böckeler, May 2026)
Martin Fowler — Structured Prompt-Driven Development (SPDD) (Thoughtworks, 2026)
OpenAI — Harness engineering: leveraging Codex in an agent-first world (Ryan Lopopolo, February 2026)
Addy Osmani — Agent Harness Engineering (synthesis, 2026)

Related Engineering & Research

Top comments (2)

Eugene Maiorov • May 28

This is an amazing breakdown of harness engineering! I love the formula 'Agent = Model + Harness.' It makes so much sense because people always blame the AI model when a project fails, but usually, it's just the tools and infrastructure around it that broke.

Since building a production-ready harness layer with tools and sandboxes is so much work, developers should check out vectoralix.com. It is a managed platform built specifically for publishing and hosting MCP servers. It handles the infrastructure part of the harness so you can just focus on your agent's core logic. Thanks for writing this!

Harjot Singh • May 31

"Agent = Model + Harness. If you're not the model, you're the harness." That's the whole thing in one line, and almost nobody internalizes it until they've shipped a demo that died in production. The reason it's so counterintuitive is that the demo makes the model look like the product, so when it falls apart people reach for a better model, when the actual failure was instructions, tools, loops, verification, logging, exactly your list. The harness is where engineering lives now, the model is a component you rent. The two pieces I'd put at the top of the harness hierarchy: verification (the agent must be able to check its own work or abstain, not just emit confidently) and bounded loops (most production blowups are an agent retrying forever or escalating its own scope). Get those two and the rest is plumbing. This is the entire bet behind what I'm building with Moonshift, the model is commoditizing, the durable moat is the harness around it. Genuinely one of the clearest framings I've seen on dev.to. Where do you draw the line on how much the harness should constrain vs let the model decide, fixed graph for known flows, free loop only for the unknown?