Collin Wilkins

Posted on Jun 1 • Originally published at collinwilkins.com

What Is an AI Agent Harness?

#productivity #ai #programming #devtools

This is for people trying to understand the infrastructure around large language models: Claude Code, Codex, Cursor, LangGraph, MCP servers, repo instructions, permissions, hooks, and all the agent plumbing that has popped up.

The short answer:

The model is the brain. The harness is the operating environment that makes the brain useful.

Or shorter:

Models generate answers. Harnesses generate trust.

Start with the agent

At the simplest level, an AI agent is a model inside a loop:

You give it a task
The model thinks about what to do
It answers or calls a tool
The tool does something
The result goes back to the model
The loop repeats until the task is done

That loop can be a few dozen lines of code or a whole product with permissions, memory, tools, logs, tests, and UI around it. The loop is the seed. The harness is what makes it safe and useful in real work.

Same model, different results

The easiest mistake with agents is blaming the model for everything. Sometimes the model really is the problem but it often isn't.

The same model and task can produce different output because agents are probabilistic plus the environment changes what the model sees, what it can do, how it gets checked, and when it is allowed to stop.

One setup produces a clean pull request, runs the tests, catches the edge case, and leaves a useful summary.

While another edits the wrong file, forgets the project conventions, says "done" without verification, and hands you a pile of slop.

A better model can help, a better harness narrows the variance.

Model vs. agent vs. harness

People blur these words together, so separate them first.

Term	Plain-English meaning	Example
Model	The brain that predicts, reasons, and writes	Claude, GPT, Gemini
Agent	The model plus a loop that lets it act	Claude Code fixing a bug
Harness	The system around the agent that guides and checks the work	Instructions, tools, memory, tests, hooks
Tool	Something the agent can use	Shell, browser, file search, calculator, MCP server
Memory	Context that survives beyond one prompt	`CLAUDE.md`, `AGENTS.md`, project memory, handoff notes

If you only remember one line:

A model thinks. An agent acts. A harness keeps the agent from acting like an idiot.

A harness is not a framework

A framework helps humans assemble agents. LangGraph, LangChain, and similar tools give you graphs, state, nodes, tool bindings, memory, middleware, and routing. You can use those pieces to build a harness.

But the framework itself is not automatically the harness.

A harness is the working environment around the agent. It runs the loop, exposes tools, injects project context, enforces permissions, verifies outputs, and keeps useful memory.

That distinction matters. If you buy or build a framework, you still own the harness design. If you use Claude Code, Codex, Cursor, or Windsurf, you are already working inside a harness. The question is how well it fits your codebase, risk tolerance, and workflow.

The simplest harness you already know: CLAUDE.md

Claude Code is a useful doorway into this idea because the harness is visible.

Every serious Claude Code setup has a CLAUDE.md file or the cross-tool version AGENTS.md. Anthropic's docs describe memory as persistent instruction Claude reads at the start of a session. It carries commands, structure, standards, workflow preferences, and recurring mistakes.

That is harness work.

Your CLAUDE.md isn't the agent or the model. It is one guide inside the harness.

A good one feels like a brief you would give a sharp contractor before they touched your codebase:

# CLAUDE.md

## Project
One sentence on what this project does and who uses it.

## Commands
- Dev: `npm run dev`
- Build: `npm run build`
- Type check: `npx tsc --noEmit`

## Architecture
- `src/lib/services/` - business logic
- `src/components/` - UI components

## Rules
- Never commit `.env` files or secrets
- Make minimal changes and avoid unrelated refactors
- Run type check after code changes
- Static export only, no server-side features

## Workflow
- Ask before making architectural changes
- Run tests before saying the task is done
- When unsure, explain the tradeoff

That file doesn't impact model intelligence. It makes the environment stricter and gives the model fewer ways to wander off.

The harness decides what context the model gets, what it can touch, how it proves the work, and what survives after the session ends.

What a harness actually does

A harness has two jobs:

Help the agent get it right the first time
Catch problems early enough that the agent can correct itself

Martin Fowler's framing is useful here. He splits a harness into guides and sensors.

Piece	What it does	Examples
Guides	Shape the agent before it acts	`CLAUDE.md`, `AGENTS.md`, skills, tool descriptions, project docs
Sensors	Check the agent after it acts	tests, linters, type checks, screenshots, review agents
Memory	Carries forward what should survive the session	project memory, handoff notes, session recaps, index files

Guides steer the agent before it acts. Sensors tell you whether the work held up. Memory is the notebook for future reference. A useful harness needs all three.

Pay attention to the last arrow. A harness controls how the agent acts this time and how the next run starts a little less cold.

The architecture underneath

Once you move past the simplest version, a modern harness starts to look like a small control plane.

The Arize primer frames nine pieces as one system: loop, context, tools, system prompt assembly, permissions, hooks, persistence, built-in skills, and sub-agent management.

That is what makes this more than prompt engineering. A prompt can ask the model to be careful. A harness can route the model through checks, permissions, and recovery paths that make care more likely.

The best harnesses push judgment to the model and keep control in the system. The model decides which files to work on. The harness decides whether it is allowed to modify them.

Same model, better harness

This is not just semantics.

LangChain published a useful example in February 2026. They kept the model fixed and changed the harness around their coding agent. The score moved from 52.8 to 66.5 on Terminal Bench 2.0.

A better harness improved the same model by roughly 14 points.

The changes included better prompts, tool setup, verification middleware, trace analysis, and loop detection. That is the pattern teams should steal.

Don't start by asking, "Which model will solve this?" Ask, "What does the model need around it to do this reliably?"

A simple example

Say you ask an agent to fix a failing test. Without much harness, the session often looks like this:

The model reads your prompt
It scans a few files
It changes some code
It rereads its own code
It says "done"

That sounds fine until you notice the missing step: it never ran the tests.

With a basic harness, the flow changes:

The agent starts and loads the project instructions
It sees the commands, conventions, and files that matter
It changes the code
The harness tells it to run the test command before exiting
If tests fail, the agent keeps going
When it finishes, it writes a short handoff note

The minimum viable harness

For most teams, a minimum viable harness has four layers:

Layer	Small version	Why it matters
Instructions	`CLAUDE.md` or `AGENTS.md`	The agent knows the project before you re-explain it
Tools	One or two tools it actually needs	The agent can act instead of only talk
Verification	Test, lint, type check, screenshot, review pass	The agent has to prove the work
Memory	Handoff note, index file, project memory	The next session doesn't start cold

That is already a harness. Keep it simple - the point is reliability, not complexity.

What this looks like in an engineering org

For an enterprise team, the harness becomes part of engineering operations. It's a shared operating agreement:

A repo-level AGENTS.md or CLAUDE.md with architecture, commands, boundaries, and review expectations
A small approved tool set: search, edits, shell, browser, test runner, docs lookup
Permission modes for read-only work, workspace edits, and dangerous actions
Hooks that block secrets, destructive commands, production credentials, or unreviewed deploy paths
Verification gates that make the agent run the same checks a human engineer would run
Session notes that explain what changed
A small eval set of real tasks that shows whether the harness is getting better or just louder

How to write the instruction file

The blunder with CLAUDE.md is treating it like a wish list.

"Be a senior engineer."

"Think step by step."

"Write clean code."

Fine, but mostly wasted space. Use the file for things the agent would otherwise get wrong:

Critical commands: build, test, lint, type check, run one file.
Architecture map: where things live and what belongs where.
Hard rules: the specific mistakes the agent must not make.
Workflow preferences: when to ask, act, change, and verify.
Out of scope: files, systems, or integrations the agent should not touch.

Keep it short.

Anthropic's current memory docs emphasize hierarchy, imports, recursive lookup, and specific instructions over vague guidance. A CLAUDE.md shapes behavior. It does not physically prevent bad actions.

For enforcement, you need settings, permissions, tests, hooks, or human review. That is the difference between a guide and a guardrail.

Where tools fit

Tools are the agent's hands.

They let the model search files, run commands, query APIs, calculate, open a browser, read a spreadsheet, or edit a document.

The common mistake is thinking more tools equals a smarter agent. Usually it means a more confused one. Tool-selection accuracy drops from around 43% to under 14% as the tool count grows, and every tool definition spends roughly 300-1,400 tokens whether it gets used or not.

Start with the smallest tool set that can do the job. Rewriting probably needs no tools. Fixing a bug needs file access and a test command. Researching current prices needs web search or an API.

One tool should have one clear job.

Bad:

manage_files(action, file, destination, overwrite, format, permissions)

Better:

read_file(path)
write_file(path, content)
delete_file(path)

Tool design is part of the harness.

Where memory fits

Memory is where people make this sound harder than it needs to be. There are two simple versions: what has been said in this session, and what should survive across sessions.

Early on, longer-term memory can be boring markdown:

what changed
what is still broken
what command to run next
what the next session should not repeat

That can live in a handoff note, an INDEX.md, a session recap, or a project memory file. In my note system, the agent reads an index, topic map, and CLAUDE.md before touching anything.

Not glamorous. Useful.

Workflows before agents

Not every problem needs a fully autonomous agent.

Anthropic makes a useful distinction: workflows follow predefined code paths, while agents dynamically decide their process and tool use. If the steps are predictable, use a workflow:

write outline
check outline
write draft
review draft

If the steps are not predictable, use an agent:

inspect this unfamiliar codebase
figure out why the tests fail
decide which files to change
verify the fix

Start with the simplest pattern that works. Most useful setups begin as workflows and only become agents when the model genuinely needs to choose the path.

When multiple agents make sense

Start with one agent and a good harness. Multiple agents make sense when the roles are meaningfully different:

one researches, one writes
one implements, one reviews
one can read sensitive data, one can execute actions
one routes tasks, specialists handle narrow domains

If the second agent doesn't have a different job, permission set, or evaluation role, you probably added complexity for vibes. The safer pattern is a supervisor:

User -> Main agent -> Specialist agent only when needed -> Verification

There is one multi-agent pattern worth knowing early: separate the builder from the judge.

Anthropic Labs described this in their long-running application harness: planner, generator, evaluator. The generator built. The evaluator inspected the result with Playwright, graded it, and sent feedback into the next sprint.

The principle scales down:

Small setup	Bigger setup
Write code, then run tests	Generator builds, evaluator tests in browser
Ask a second agent to review	Dedicated evaluator grades each sprint
Write a checklist before starting	Planner and evaluator negotiate what "done" means

The point is not "use three agents." The point is: don't let the same session that made the thing be the only judge.

What not to overbuild

Don't turn your first harness into a platform. Start with one job, one agent, one clear instruction file, one or two tools, one verification step, and a few test prompts. Then watch where it fails. When the same failure repeats, tweak the harness, iterate.

The practical build path

If you want to build your first harness today:

Write one sentence describing the agent's job.
List the tools it truly needs (follow least privilege!)
Write the rules it must follow.
Define the output format.
Add one verification step.
Test it on five real examples.
Add memory only when the next session needs something from this one.

You can ask an LLM to help:

I want to build an AI agent.

Goal:
[what I want it to do]

Example user requests:
[5 messy examples]

Tools it may use:
[web search / files / calculator / custom API / none]

Rules it must follow:
[non-negotiables]

It must never:
[boundaries]

Please turn this into an agent spec, system prompt, minimal tool list, verification checklist, and 10 test cases.

The formula is simple:

Agent = Role + Goal + Tools + Rules + Output format
Harness = Instructions + Tools + Verification + Memory

That is enough to start.

Takeaway

A harness is the setup around an agent that makes its work more reliable.

It isn't just an SDK or just a prompt. It isn't just CLAUDE.md.

It is the operating environment: instructions, tools, tests, memory, permissions, hooks, recovery paths, and feedback loops.

The model gives you capability. The harness decides whether you can trust it.

Some of the harness is durable: verification discipline, context preparation, running tests before calling something done. Some is scaffolding that dissolves as models improve. Know which half you're building on.

DEV Community