ShipWithAI

Posted on Apr 20 • Originally published at shipwithai.io

Harness Engineering: Why the System Around AI Matters More Than the AI Itself

#claude #ai #productivity #programming

Harness engineering is everything around your AI agent except the model: memory, tools, permissions, hooks, observability. LangChain gained 13.7 benchmark points by changing only the harness (52.8% to 66.5%, same model). Most developers only have Layer 1 (CLAUDE.md). Production needs all 5.

Two lines of config. Same AI model. Completely different reliability:

# CLAUDE.md approach (can be ignored)
"Never delete production database tables."
# Claude reads this, weighs it against 200K tokens of context, may ignore it.

# Hook approach (always enforced)
# PreToolUse hook: command contains "DROP TABLE" + env=production → exit 2 → BLOCKED.

The first is advice. The second is enforcement.

One lives in a markdown file that competes with thousands of other tokens for the model's attention. The other is a shell script that runs before every command and cannot be bypassed. The gap between these two approaches is the gap most teams don't know exists.

That gap has a name now: harness engineering.

What is harness engineering? (And why prompt engineering isn't enough)

Harness engineering is the discipline of building constraints, tools, feedback loops, and observability around an AI agent to make it reliable in production. The formula, popularized by LangChain and refined on Martin Fowler's site:

Agent = Model + Harness

The model is a commodity. The harness is your competitive advantage.

Mitchell Hashimoto, creator of Terraform and Ghostty, defined the core idea: anytime you find an agent makes a mistake, you engineer a solution so the agent never makes that mistake again. In Ghostty's repository, each line in the AGENTS.md file corresponds to a specific past agent failure that's now prevented.

The industry has moved through three distinct eras:

Era	Years	Focus	Key Question	Limitation
Prompt Engineering	2022-2024	Crafting better instructions	"How do I phrase this?"	Instructions get diluted in long contexts
Context Engineering	2025	Curating what the model sees	"What information does it need?"	Knowing isn't doing
Harness Engineering	2026	Building systems around the agent	"What can it do, and what can't it?"	Emerging discipline

Prompt engineering shapes what the agent tries. Context engineering shapes what the agent knows. Harness engineering shapes what the agent can and cannot do.

How did LangChain gain 13.7 benchmark points without changing the model?

By improving three harness components, LangChain jumped from 52.8% to 66.5% on Terminal Bench 2.0 (a benchmark of 89 real-world terminal tasks) while keeping the same model, gpt-5.2-codex. They went from Top 30 to Top 5. No fine-tuning. No model swap. Just harness changes.

Here are the three changes:

1. Context injection. LangChain's LocalContextMiddleware maps the environment upfront and injects it directly into the agent's context. Before this change, the agent wasted steps trying to understand its surroundings.

2. Self-verification loops. After each action, the agent verifies its output against task-specific criteria before moving on. Not just "run the tests." The agent checks whether the output matches what the task actually asked for.

3. Compute allocation. This one is counterintuitive: running at maximum reasoning budget (xhigh) scored only 53.9%, while the high setting scored 63.6%. More compute caused timeouts that hurt overall performance.

Setting	Score	Notes
Before harness changes	52.8%	Baseline, Top 30
After harness changes (high reasoning)	66.5%	Top 5, +13.7pp
Max reasoning (xhigh)	53.9%	Worse than baseline, timeouts

If you're evaluating AI coding tools by comparing model benchmarks alone, you're measuring the wrong variable.

What are the 5 layers of an AI agent harness?

A production harness has five layers. Most developers I talk to in the Claude Code community have Layer 1 and maybe part of Layer 2. That leaves three layers of reliability on the table.

Layer	What It Is	Problem It Solves	Claude Code Implementation
1. Memory	Persistent context across sessions	Agent "forgets" your conventions every session	CLAUDE.md, MEMORY.md, .claude/commands/
2. Tools	Extended capabilities beyond built-ins	Agent can't access your APIs, databases, or services	MCP servers, custom tools
3. Permissions	What the agent is allowed to do	Agent edits sensitive files or runs dangerous commands	settings.json allow/deny lists
4. Hooks	Automated enforcement at lifecycle points	Instructions get ignored under context pressure	PreToolUse/PostToolUse hooks
5. Observability	Knowing what the agent actually did	No visibility into agent decisions or cost	Session logs, cost tracking, action audit

Think of it like your CI/CD pipeline. You built that infrastructure once, and the whole team benefits on every push. A harness works the same way for AI agent sessions.

OpenAI demonstrated this at scale. Their Codex team shipped roughly one million lines of production code, with zero lines written by human hands, over five months. Their harness included AGENTS.md files, reproducible dev environments, and mechanical invariants in CI. Development throughput was roughly one-tenth the time a human team would have needed.

Where is your harness right now?

Run this checklist:

#	Question	Layer
1	Do you have a CLAUDE.md with project conventions and constraints?	Memory
2	Do you have MCP servers connecting Claude Code to external tools?	Tools
3	Do you have settings.json with explicit allow/deny lists?	Permissions
4	Do you have at least one PreToolUse hook that blocks dangerous actions?	Hooks
5	Can you see what Claude did in each session and how much it cost?	Observability

Your score:

1/5: You're in the majority. Most developers stop at CLAUDE.md.
2-3/5: Ahead of most. You've started building real infrastructure.
4-5/5: Production-ready. You're doing harness engineering whether you knew the name or not.

Be honest about question 4. If the answer is no, your agent can still rm -rf your project directory. CLAUDE.md says "don't do that." A hook actually prevents it.

Here's why this matters: an ETH Zurich study (Feb 2026) tested context files across 138 real-world tasks from 12 Python repositories. Human-written context files improved agent success by only about 4%. LLM-generated ones actually reduced success by ~3% while increasing inference costs by over 20%. Instructions alone aren't enough. You need enforcement layers.

How do you start building a harness today?

You don't need all 5 layers at once. Start with three high-impact changes that take less than 30 minutes total.

Quick Win 1: Create a MEMORY.md (5 minutes)

MEMORY.md is a lightweight index that points to where knowledge lives in your project. Unlike CLAUDE.md (which holds static rules), MEMORY.md tracks evolving state: recent decisions, architectural changes, active work.

- [Auth](src/lib/auth/) — Clerk, not NextAuth. Migrated March 2026.
- [DB](prisma/schema.prisma) — PostgreSQL on Supabase. All queries via Prisma.
- [Deploy](docs/deploy.md) — Vercel preview for PRs, production on main.
- [Testing](vitest.config.ts) — Vitest unit, Playwright E2E. Min 80% coverage.
- [API](src/app/api/) — Server Actions preferred over API routes for mutations.

Quick Win 2: Add one PreToolUse guardrail hook (15 minutes)

This hook blocks Claude Code from editing sensitive files. Copy-paste ready:

#!/bin/bash
# .claude/hooks/block-sensitive-files.sh
# Blocks edits to .env, credentials, and CI config

INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')

SENSITIVE=('.env' 'credentials' '.github/workflows' 'secrets')

for pattern in "${SENSITIVE[@]}"; do
  if [[ "$FILE_PATH" == *"$pattern"* ]]; then
    echo "BLOCKED: Cannot edit sensitive file: $FILE_PATH" >&2
    exit 2
  fi
done

exit 0

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "bash .claude/hooks/block-sensitive-files.sh"
          }
        ]
      }
    ]
  }
}

Quick Win 3: Enable cost awareness (10 minutes)

Track what each session costs so you notice anomalies early. Boris Cherny, creator of Claude Code, calls verification "probably the most important thing" for quality:

"Give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result."

Start simple: review ~/.claude/projects/ after each session to check what Claude did and how much it cost.

FAQ

What is the difference between harness engineering and prompt engineering?

Prompt engineering shapes what the agent tries. Context engineering shapes what the agent knows. Harness engineering shapes what the agent can and cannot do. They're not replacements — they're layers. A production AI workflow uses all three, but harness engineering provides the strongest reliability guarantees because it uses enforcement (hooks, permissions) rather than suggestions (prompts, context).

Do I need harness engineering for Claude Code?

Yes. Claude Code is itself a harness that Anthropic built around their model. But it's the inner harness. You need an outer harness tailored to your project: CLAUDE.md for conventions, hooks for guardrails, MCP servers for tools, permissions for safety boundaries, and observability for cost control.

Is harness engineering only for Claude Code?

No. The principles apply to any AI coding agent: Cursor, GitHub Copilot, OpenAI Codex, Windsurf, Cline. Claude Code happens to offer the most programmable harness surface (17 hook events, MCP protocol, skills system), which is why examples here use it. The concepts transfer directly to other tools.

Try it now: Pick one quick win above and implement it before your next Claude Code session. Quick Win 2 is copy-paste ready and takes 3 minutes.

What's your harness score right now? Drop it in the comments — I'm curious how many devs have gone beyond Layer 1.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.

DEV Community