mufeng

Posted on Jun 11

Your AI Agent Is Underperforming Because of Your Harness, Not the Model

#ai #claude #llm #agents

The pattern is familiar: your AI agent produces garbage output, so you switch to a better model. Things improve for a few days, then the same problems resurface. You upgrade again.

Here's what you're probably missing: the model is just one input. The rest is harness — and that's almost always where the real problem lives.

What Is a Harness?

The cleanest definition comes from engineer Vtrivedy, who coined the term:

Agent = Model + Harness. If you're not the model, you're the harness.

A harness encompasses everything except the model itself:

System prompts, CLAUDE.md / AGENTS.md files, Skill definitions
Tool descriptions, MCP servers, and their technical specifications
Execution environment: filesystem, sandboxes, headless browsers
Subagent orchestration: spawning logic, task handoffs, routing
Hooks: deterministic enforcement layers (linting, formatting, permission checks)
Observability: cost monitoring, latency tracking, logs

This entire surface area is yours to design, not the model provider's.

Claude Code, Cursor, Codex, Cline — these tools might run on identical underlying models, but the behavior you experience is dominated by the harness each one provides. The underlying model might be identical across two setups; the behavior you see will be completely different.

This leads to a counterintuitive but well-supported finding:

A decent model with a great harness consistently outperforms a great model with a bad harness.

Why Engineers Default to Model-Blaming

When an agent does something nonsensical, blaming the model is the path of least resistance. It's the most visible component, and failures often look like reasoning problems.

But most failures are legible if you look closely:

Agent ignored a coding convention → Add it to AGENTS.md
Agent ran a destructive command → Write a Hook to block it
Agent got lost in a 40-step task → Split into Planner and Executor subagents
Agent consistently ships broken types → Wire a type-checker signal into the loop

As HumanLayer frames it: "It's not a model problem. It's a configuration problem."

Consider the performance benchmarks: a leading model running inside an off-the-shelf framework often scores dramatically lower than the exact same model running in a custom, highly-tuned harness. The model's capabilities didn't change — the harness is what unlocks them.

The Ratchet: Every Failure Becomes a Rule

The most important habit in harness engineering is treating agent failures as permanent signals, not one-off flukes to retry and forget.

Think of a mechanical ratchet: it only moves forward, never backward.

When an agent makes a mistake, you don't retry and hope for better luck. You engineer a permanent fix so the same exact failure cannot happen again.

Example: An agent submits a PR with commented-out tests. It gets merged into main.

Wrong response: Fix it manually. Move on.

Harness response:

Add to AGENTS.md: "Never comment out tests. Delete or fix them."
Add a pre-commit Hook that flags .skip( in any diff automatically.
Update the Reviewer subagent's instructions: commented-out tests are a blocking issue.

Three layers. Same failure is structurally impossible now.

Constraints should be added when you observe a real failure, and removed when a more capable model makes them redundant. Every line in a good system prompt should trace back to a specific, historical failure. A harness that grows without bound is just as broken as one that never grows.

CLAUDE.md Is a Failure Log, Not Documentation

This is the mistake I see most often. Engineers treat CLAUDE.md like a README written for an AI: project overview, tech stack, coding conventions. Useful — but incomplete.

Mature harnesses treat CLAUDE.md differently: every rule should trace back to a specific, real incident. If you can't remember the failure that generated a rule, it's probably noise that dilutes the signal of the rules that actually matter.

Examples of rules with provenance:

"Never use any type without explicit authorization" → From a production bug after TypeScript checks were bypassed.
"Run the full test suite before committing, even for one-line changes" → From a regression where a small fix touched adjacent logic without running tests.
"Back up configuration files before modifying" → From an agent that overwrote a production config.

Rules derived from real incidents carry weight in the agent's reasoning. Rules written speculatively get treated as suggestions — not because the model is bad, but because they lack the contextual authority that real constraints carry.

Context Engineering: The Harness Layer People Miss

There's a component of harness design that gets less attention than it deserves: context management.

Antonio Gullí, Engineering Director at Google, defines Context Engineering in Agentic Design Patterns:

Not information dumping. Carefully selecting, trimming, and packaging context. To get AI to peak accuracy, you must give it short, focused, powerful context.

This distinguishes Context Engineering from the more common Prompt Engineering. Prompt Engineering asks: How should I phrase this request? Context Engineering asks: What should already be in front of the agent before it even sees the request?

The discipline applies to every part of the harness:

Tool descriptions: Concise and precise, not comprehensive
Skill files: Exact schemas and templates the agent needs, not everything
System prompts: Specific constraints from real failures, not generic guidelines

An agent drowning in context doesn't perform better — it performs worse. Every line in your CLAUDE.md or system prompt is doing Context Engineering. Noise in equals noise in the agent's reasoning.

Two-Tier Configuration: Team Brain + Personal Brain

Claude Code's configuration architecture is worth understanding as a design pattern applicable to any agent harness.

Project .claude/ — lives in the repo, committed to Git
Team-shared rules, hooks, security policies, workflow definitions. Every engineer who clones the repo inherits the full agent behavior constraints automatically. This is an engineering asset, maintained alongside code.

Global ~/.claude/ — personal directory, stays out of Git
Personal coding style preferences, cross-project shortcuts, individual tool configurations.

The separation enforces the right ownership boundaries: team standards are reliable and shared, personal preferences are free and local. New team members inherit your agent setup the moment they clone the repository.

What Changes When You See It This Way

Once you internalize Agent = Model + Harness, the questions you ask about AI tools shift.

Before:

Which model has better code generation?
What's the context window size?
What's the price per token?

After:

How mature is this harness?
What does the failure recovery path look like?
How are harness rules maintained over time?
What's the observability story?

The model is table stakes at this point. The harness is the differentiator.

Anthropic's engineering team published this framing directly:

The gap between what today's models can theoretically do and what you actually see them doing is largely a harness gap.

The ceiling isn't the model. The floor you're operating at is almost entirely determined by your harness.

Start Here

Open your CLAUDE.md, or create one if it doesn't exist.

Think about the last thing your agent got wrong. Not a model failure — a behavioral failure. Something it did that violated an expectation.

Write one rule. Note where the failure came from. One sentence is enough.

That's the first notch on the ratchet. Over months, this file becomes a compressed history of your collaboration — every line representing a mistake that was never repeated.

The harness isn't designed. It's earned.

I write about practical AI engineering, agent design, and building production systems with Claude. Follow for more.

Top comments (1)

Kaspar von Grünberg • Jun 15

A harness runs one agent. Someone still has to provision the racks: identity, observability, governance, the context layer, path specs that define what "done" actually means. That's platform engineering, and it's why most enterprise AI programs keep failing despite good models and decent harness work. The substrate isn't there. just my thoughts when reading this.