DEV Community

Cover image for Harness Engineering for AI Agents
Akash Thakur
Akash Thakur

Posted on

Harness Engineering for AI Agents

Agent = Model + Harness. If you're not the model, you're the harness.

Model + Harness description


If you have shipped an AI agent that worked brilliantly in a demo and then fell apart in production, you have already met the harness problem. The model was probably fine. The scaffolding around it — instructions, tools, loops, verification, logging — was not.


1. Start Here

New to AI agents? Read this section first. Skip ahead if you already know what an LLM, tool call, or agent loop is.

iamge 1 description

What is an LLM? (The brain)

An LLM (Large Language Model) — like GPT-4o, Claude, or Gemini — is a powerful text prediction system. You give it a message, it gives you a reply. By itself, an LLM:

  • Cannot remember anything between conversations
  • Cannot open files, run code, or browse the web on its own
  • Cannot take actions in the real world
  • Cannot know if its answer was actually correct

Analogy: Think of an LLM as an incredibly smart consultant locked in a room with no phone, no computer, and no memory of your last meeting. They can answer any question brilliantly — but only using what's on the piece of paper you slide under the door.

What is an AI Agent? (Brain + hands)

An AI Agent is an LLM that has been given the ability to take actions: read files, write code, run commands, call APIs. The key differences from a chatbot:

Chatbot Agent
Steps One reply Multi-step chain
Tools None Shell, browser, APIs, files
Memory None Persisted across steps
Verification None Checks its own work
Goal Answer once Complete a task

Analogy: Now the consultant has a phone, a laptop, and a to-do list. They can look things up, send emails, and keep working until the project is finished.

What is the Harness?

The harness is everything you build around the LLM to make an agent work safely and reliably. The LLM is the brain; the harness is the operating system, the safeguards, the toolbox, and the rulebook.

Analogy — the robot suit: Imagine Iron Man. Tony Stark (the LLM) is the genius — but without the suit (the harness), he's just a man in a t-shirt. The suit provides tools (repulsors, flight), safety systems (shields, ejection), memory (JARVIS), constraints (altitude limits), and observability (HUD). The suit doesn't make Tony smarter — it makes his intelligence actually useful in the world.

Harness description

The five things a harness does:

  1. Gives the agent its instructions and knowledge — system prompts, documentation links, task files
  2. Gives the agent its tools — shell, file editor, API caller; controls which tools are available per step
  3. Runs the loop — calls the model, executes tool calls, feeds results back, repeats until done; sets step/cost/time limits
  4. Checks the work — runs tests, linters, type-checkers before the agent says "done"; the agent can't fake this
  5. Keeps records — every tool call, every model response, every error is logged for debugging and auditing

According to LangChain's research: simply changing the harness while keeping the same model moved a coding agent from rank #30 to rank #5 on a public benchmark.

Quick check before continuing:

  • What can an LLM not do by itself? (take actions, remember things, access the internet)
  • What makes something an "agent" vs a chatbot? (uses tools and loops until task is done)
  • What is the "harness"? (everything around the LLM — prompts, tools, loop, checks, logging)

How the agent loop works

ahent loop


2. What Is Harness Engineering?

Harness engineering is the practice of designing everything around an AI model — the instructions you give it, the tools you connect, the memory system, the sandbox it runs in, the verification checks, the cost limits, and the logging — so that a raw LLM becomes a dependable agent that can complete real tasks reliably.

Shared insight across all sources: The gap between what today's models can do and what you see them do is largely a harness gap. A decent model with a great harness beats a great model with a bad one.

harness engennimg image


3. The Agent Formula

Agent = Model + Harness
Enter fullscreen mode Exit fullscreen mode

A raw LLM produces text. It becomes an agent only when a harness supplies:

  • State — memory across steps
  • Tool execution — the ability to act
  • Termination condition — what "done" means
  • Feedback loops — knowing if actions worked
  • Enforceable constraints — rules it cannot break

Anthropic is explicit: when you evaluate "an agent" you are evaluating harness + model together — the two cannot be separated in practice.

agent formula

4. Anatomy of an Agent Harness

The harness is every piece of code, configuration, and execution logic that is not the model itself.

Component What it does
System prompts Role definition, output format constraints, planning reminders injected each step
Tools & MCP APIs, filesystem, browser, DB queries — with clear schemas and failure contracts
Sandbox / runtime Docker containers, VMs — isolated and disposable per task
Orchestration Subagent spawning, handoffs, model routing, parallel worker coordination
Hooks & middleware Compaction triggers, lint/test gates, continuation checks between steps
Observability Full traces, span timings, cost per run, tool-level failure attribution

Anatomy description

5. LangChain Harness Diagrams

LangChain's anatomy post (Vivek Trivedy, March 2026) uses three core figures:

Figure 1 — Agent Harness: hub around the model

hub harness

Figure 2 — Desired behaviour → harness design

harness design

Figure 3 — Model–harness co-evolution loop

model harness

Figure 4 — ReAct loop

react loop

Figure 5 — Long-horizon stack (compounding primitives)

Long-horizon stack

Key insight: The best harness for your task is not always the one the model was post-trained on. Swapping only the harness moved a coding agent from Top 30 to Top 5 on Terminal Bench 2.0.


6. CAR Model: Control · Agency · Runtime

A rigorous lens for designing and reporting agent systems:

Dimension Plain English Examples
Control The rules — what's the agent allowed to do? System prompt, guardrails, budget limits, stop conditions
Agency The hands — what tools can it use? Tool schemas, permission levels, planning logic
Runtime The environment — where does it run? Loop, sandboxes, memory files, crash recovery

Debugging heuristic: When your agent misbehaves, ask — is this a Control problem (bad rules), an Agency problem (wrong tools), or a Runtime problem (environment issue)?

car model


7. Four Levers of Harness Design

Four independent levers you can tune without swapping the underlying model:

Lever 1 — Context Design (what the model knows and when)

Context is finite. Treat it as a curated, perishable resource:

  • Short, stable entry point: AGENTS.md (~100 lines) as table of contents, not encyclopedia
  • todo.md pattern: agent re-reads its plan each step to prevent goal drift
  • Structured docs/ directory enforced by linters and a recurring "doc-gardening" agent

Lever 2 — Tool Selection (least privilege per phase)

A framework registers tools; a harness decides which tools this agent may use for this specific task:

  • Gate access by workflow phase: plan → execute → verify
  • Specify what happens on failure: retry with backoff, fall back, or escalate to human
  • Tool allowlists, retry budgets, and clear failure contracts are harness-layer concerns

Lever 3 — Constraint Management (budgets, loops, "done")

  • Max iterations and time budgets injected into the prompt
  • Loop detection — if the same action repeats N times, escalate
  • Self-verification gate — declare_complete only after deterministic checks pass
  • Cost cap — abort or ask for human approval when token spend exceeds threshold

Lever 4 — Production Hardening

  • Sandboxing — ephemeral containers per task; if one dies, re-provision
  • Authentication — inject credentials securely, never hardcode
  • Memory compaction — summarise old turns; evict stale content on policy
  • Structured logging — every tool call with input, output, latency, cost
  • Eval suites — regression tests before any prompt or model change
  • Human-on-the-loop — escalation path when confidence or budget thresholds trip

8. The Agentic Loop

The harness owns the loop — its termination rules, its stuck-detection, its escalation paths. "Call the model until done" is not a harness; it is a bug.

Agentic Loop description

Harness Maturity Ladder

Level Name What it produces
H0 Minimal Model output only. No traces, no failure attribution. OK for low-stakes prototypes.
H1 Basic tools Reproduction logs and tool-call traces. You can replay what happened.
H2 Verification Deterministic requirement checks and failure attribution. Diffs are reviewable.
H3 Full substrate Structured verification reports, intervention logs, entropy audits, signed episode package.

For beginners: Start at H1 (basic traces). Move to H2 once you add verification gates. Target H3 only for production or auditability requirements.


9. Evaluation Harness vs Agent Harness

Two distinct concepts that share vocabulary:

Agent harness Evaluation harness
Purpose Run agents in dev or production Score agents on fixed task suites
Output User-facing results, PRs, real actions Pass/fail rates, regression diffs, metrics
Graders Inline verification (tests, linters) Deterministic checks + LLM-as-judge + human rubrics
Isolation Per-task sandboxes in production Fresh containers per episode; no shared state
Tools LangGraph, Claude Agent SDK, custom loops LangSmith, Braintrust, Phoenix

Anthropic's advice: Run end-to-end evals in fully isolated environments. Maintain both capability evals (can the agent do the task?) and regression evals (did a harness or prompt change break what already worked?). Invest most energy in high-quality test cases — not the framework.

eval des

10. Framework vs Harness

A common mistake is treating an agent framework as a complete harness. They operate at different layers:

Agent Framework Your Harness (on top)
LangChain, LangGraph, CrewAI, AutoGen, BAML What "done" means and max iterations
Tool registration and memory APIs Context eviction & compaction policy
Reusable primitives and integrations Per-task tool allowlists & retry semantics
Provides the loop — does not decide loop rules Verification gates before committing output
Excellent, but not sufficient alone Observability, cost caps, escalation paths

Key finding: A well-built harness can make a weaker model outperform a stronger one in a poorly-scaffolded system. Reliability is a property of the full model–harness–environment system — not of weights alone.


11. SPDD: Structured Prompt-Driven Development

From the Thoughtworks / Martin Fowler article: while individual developers get faster with AI, teams often don't — because more code is generated faster than it can be aligned, reviewed, and governed. SPDD treats prompts as first-class engineering artifacts.

The REASONS Canvas

A seven-section template that forces clarity before any code is generated:

Section Dimension What it captures
R Requirements What problem are we solving? What is the definition of done?
E Entities Domain objects and their relationships
A Approach Strategy and design pattern choice
S Structure Where in the system does this change live?
O Operations Concrete, testable implementation steps with method signatures
N Norms Cross-cutting standards: naming, observability, defensive coding
S Safeguards Non-negotiable invariants: security rules, performance limits

Canvas

The SPDD Workflow

SPDD Workflow

Golden rule: When reality diverges, fix the prompt first — then update the code.

Three Core Skills

  1. Abstraction first — design before you generate. Clarify what objects exist, how they collaborate, and where boundaries are, before writing any prompt. Beginner version: draw a box diagram before writing a single prompt.

  2. Alignment — lock intent before you write code. Make "what we will do / won't do" explicit upfront. Agree on standards and hard constraints. Beginner version: write acceptance criteria in the prompt before asking for code.

  3. Iterative review — turn output into a controlled loop. Reviews focus on intent, not "spot the bug." Beginner version: always read the generated code against the spec before running it.

When SPDD fits

Rating Scenario
⭐⭐⭐⭐⭐ Repeated business logic, many similar APIs, long-term maintainability
⭐⭐⭐⭐⭐ High compliance, strict architectural or security rules
⭐⭐⭐⭐ Team delivery — traceable, reviewable changes
⭐⭐ Emergency hotfixes (speed > discipline)
Exploratory spikes, one-off scripts, poorly-defined domains

12. Proven Production Patterns

Pattern 1 — The Tight Loop (Ralph loop)

Each iteration spawns a fresh agent instance with clean context:

  1. Pick one story from prd.json (passes: false)
  2. Implement it
  3. Run tests + typecheck + lint
  4. On pass: git commit, mark story done
  5. Repeat

Memory persists outside the context window — in git, progress.txt, and AGENTS.md — never in bloated chat history.

memory

Pattern 2 — Deterministic Back-Pressure

Tests, type-checkers, and linters act as governors: the model cannot declare victory without hard signals from the environment. Per OpenAI's Codex experiment:

  • Enforce boundaries centrally (custom lints with remediation messages baked into the error text)
  • Allow autonomy locally
  • Humans steer, agents execute, infrastructure enforces taste

Pattern 3 — Agent-Legible Repositories (OpenAI)

Anything the agent cannot see in-context effectively does not exist — Slack threads, verbal agreements, Google Docs. Push all knowledge into:

  • Versioned docs/ with structured sub-folders
  • Execution plans with decision logs
  • Auto-generated schema files
  • Observable logs (LogQL, PromQL) the agent can query directly
  • A dedicated "doc-gardening" agent that opens PRs when docs drift from code

Pattern 4 — Sensors with Self-Correction Guidance (Fowler)

From Böckeler's sensors article — sensor error messages must include how to fix the problem, not just flag it:

# Bad sensor message (human-readable only):
ERROR: Function exceeds complexity threshold

# Good sensor message (agent-readable with guidance):
ERROR: Function exceeds complexity threshold.
[Guidance: Refactor into smaller functions with single responsibilities.
 If a refactoring is truly not possible in this case, you may increase
 the threshold by 1, but document why in a comment.]
Enter fullscreen mode Exit fullscreen mode

Sensor schedule:

When Type Examples
During coding session Fast, continuous TypeScript checker, ESLint, dependency-cruiser, test suite
CI pipeline Post-integration Same sensors run on clean infrastructure
Recurring / scheduled Drift detection Security review, modularity review, dependency freshness

13. Examples & Code Snippets

AGENTS.md — progressive disclosure

# Agent operating instructions

## Repository map
- Architecture: docs/ARCHITECTURE.md
- Quality bar: docs/QUALITY.md
- Active plan: docs/exec-plans/active/<task-id>.md

## Rules (always apply)
1. Run tests before declaring a task complete.
2. Never commit secrets or edit .env files.
3. Update todo.md every step — re-read it before each tool call.

## Tools
- Use bash only inside the project sandbox.
- For UI bugs: use browser skill (screenshot + DOM snapshot).

## When stuck
- Read docs/references/ for library-specific LLM guides.
- If verification fails twice, stop and summarize blockers.
Enter fullscreen mode Exit fullscreen mode

docs/ layout (OpenAI agent-first repo)

AGENTS.md              # ~100 lines — table of contents only
docs/
├── design-docs/
│   ├── index.md
│   └── core-beliefs.md
├── exec-plans/
│   ├── active/
│   ├── completed/
│   └── tech-debt-tracker.md
├── generated/
│   └── db-schema.md
├── references/          # LLM-friendly library guides
├── DESIGN.md
├── FRONTEND.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md
Enter fullscreen mode Exit fullscreen mode

todo.md — fight goal drift

# Task: Add rate limiting to POST /api/ingest

## Global goal
Ship rate limiting with tests; no breaking changes to existing clients.

## Current focus
- [ ] Implement token-bucket middleware
- [x] Add unit tests for limiter
- [ ] Update OpenAPI spec
- [ ] Run full test suite + lint

## Notes from last step
- Used 100 req/min per API key; config via RATE_LIMIT_RPM env.
Enter fullscreen mode Exit fullscreen mode

Phase-based tool gating

from enum import Enum

class Phase(Enum):
    PLAN = "plan"
    EXECUTE = "execute"
    VERIFY = "verify"

TOOLS_BY_PHASE = {
    Phase.PLAN:    ["read_file", "search_repo", "write_todo"],
    Phase.EXECUTE: ["read_file", "write_file", "run_shell", "run_tests"],
    Phase.VERIFY:  ["read_file", "run_shell", "run_tests", "declare_complete"],
}

def tools_for_phase(phase: Phase, all_tools: list) -> list:
    allowed = set(TOOLS_BY_PHASE[phase])
    return [t for t in all_tools if t.name in allowed]
Enter fullscreen mode Exit fullscreen mode

Minimal harness loop

MAX_STEPS = 25
MAX_COST_USD = 2.00

def run_agent(task: str, model, tools, sandbox) -> AgentResult:
    state = HarnessState(task=task, phase=Phase.PLAN, step=0, cost=0.0)
    messages = [system_prompt(), user_message(task), load_todo_md()]

    while state.step < MAX_STEPS and state.cost < MAX_COST_USD:
        state.step += 1
        active_tools = tools_for_phase(state.phase, tools)

        response = model.complete(messages, tools=active_tools)
        state.cost += response.usage.cost_usd
        trace.record(response)

        if response.tool_calls:
            for call in response.tool_calls:
                if violates_policy(call):
                    messages.append(policy_violation_message(call))
                    continue
                result = sandbox.execute(call)
                messages.append(tool_result_message(call, result))
            state.phase = Phase.VERIFY if should_verify(state) else Phase.EXECUTE
            continue

        if state.phase == Phase.VERIFY and verification_passed(sandbox):
            return AgentResult.ok(trace=trace.export())
        if loop_detected(trace):
            return AgentResult.escalate("Stuck in loop", trace=trace.export())

    return AgentResult.timeout(trace=trace.export())
Enter fullscreen mode Exit fullscreen mode

Verification gate — declare_complete

def declare_complete(sandbox) -> tuple[bool, str]:
    """Harness-owned: model cannot skip this."""
    checks = [
        ("pytest", ["pytest", "-q", "--tb=short"]),
        ("lint",   ["ruff", "check", "."]),
        ("types",  ["mypy", "src/"]),
    ]
    failures = []
    for name, cmd in checks:
        proc = sandbox.run(cmd, timeout=300)
        if proc.exit_code != 0:
            failures.append(f"{name} failed:\n{proc.stderr[-2000:]}")
    if failures:
        return False, "\n\n".join(failures)
    return True, "All verification checks passed."
Enter fullscreen mode Exit fullscreen mode

ESLint sensor with self-correction guidance

// .eslintrc — AI failure-mode rules
{
  "rules": {
    "max-params": ["error", 4],
    "max-lines": ["error", 300],
    "max-lines-per-function": ["error", 50],
    "complexity": ["error", 10]
  }
}

// Custom formatter message for no-explicit-any:
// "We want things to be typed to make it easier to avoid errors, especially
//  for key concepts. But avoid cluttering the codebase with unnecessary types.
//  Make a judgment call. If you choose not to introduce a type, suppress it
//  with: // eslint-disable-next-line @typescript-eslint/no-explicit-any -- (reason)"
Enter fullscreen mode Exit fullscreen mode

dependency-cruiser layer rule

// .dependency-cruiser.js
{
  name: "clients-no-services",
  comment: "API clients must not depend on the orchestration layer above them. " +
           "[Layers: routes -> services -> clients + domain]",
  severity: "error",
  from: { path: "^server/clients/", pathNot: "/__tests__/" },
  to:   { path: "^server/services/" },
}
Enter fullscreen mode Exit fullscreen mode

14. Harness Engineering Checklist

Use this before shipping any agent to production or before changing a prompt or model:

Area Question to answer
Termination What counts as "done"? Is there a max-step and cost budget?
Context What is always in-window vs progressively loaded on demand?
Tools Least privilege per task phase? Retry semantics on failure?
Verification Deterministic checks gate completion — not just model confidence?
Observability Full trace: tool calls, inputs, outputs, latency, cost?
Evals Regression suite runs before any prompt or model change?
Safety Guardrails intercept before any user-visible or prod-affecting action?
Recovery Agent can resume after tool or environment failure?
Memory Context compaction policy defined? Eviction priority order explicit?
Reporting Documenting model AND harness together, not model alone?

Takeaway: Harness engineering is unglamorous — it is the scaffolding, feedback loops, and governance work that turns a capable model into a reliable, auditable agent. The teams doing it seriously are the ones building things that last.


15. Glossary

Core Concepts

Term
LLM (Large Language Model)
Foundation Model
AI Agent
Harness
System Prompt
Tool / Tool Call
MCP (Model Context Protocol)
Context Window

Harness Mechanics

Term
Agentic Loop
Guardrail
Verification Gate
Sandbox
Compaction
Observability / Tracing
Orchestration
Hook

Files and Patterns

Term
AGENTS.md
todo.md
prd.json
Ralph Loop
ReAct Loop
Ratchet (Osmani)
Eval / Evaluation
HarnessCard

Fowler's Vocabulary

Term
Feedforward / Guides
Feedback / Sensors
Computational sensor
Inferential sensor
Harnessability

16. Further Reading

Primary Sources

Related Engineering & Research

Top comments (1)

Collapse
 
eugene_maiorov profile image
Eugene Maiorov

This is an amazing breakdown of harness engineering! I love the formula 'Agent = Model + Harness.' It makes so much sense because people always blame the AI model when a project fails, but usually, it's just the tools and infrastructure around it that broke.

Since building a production-ready harness layer with tools and sandboxes is so much work, developers should check out vectoralix.com. It is a managed platform built specifically for publishing and hosting MCP servers. It handles the infrastructure part of the harness so you can just focus on your agent's core logic. Thanks for writing this!