Ken Imoto

Posted on Apr 6 • Edited on May 5

Harness Engineering: 5 Companies, 5 Definitions -- Why Everyone Disagrees on What It Means

#ai #agentengineering #llm #softwareengineering

"Harness Engineering" Has a Definition Problem

In February 2026, OpenAI published "Harness engineering: leveraging Codex in an agent-first world," and the term exploded overnight.

Within weeks, Anthropic released two guides. LangChain defined it on their official blog. Birgitta Böckeler wrote a deep analysis on martinfowler.com. An arXiv paper formalized the concept.

But here's the thing: they're all saying slightly different things.

Same word. Different metaphors. Different starting points. Different conclusions.

I read all five. Here's what I found.

The One Thing Everyone Agrees On

There's a nesting structure no one disputes:

Harness ⊃ Context ⊃ Prompt

SmartScope's article captures it best:

Writing "run the linter" in CLAUDE.md versus enforcing linter execution via hooks is the difference between "almost every time" and "every time, no exceptions."

Beyond this? It gets messy.

OpenAI: "Write Declarations. Don't Write Code."

OpenAI's article dropped a bombshell.

For 5 months, their engineers wrote zero lines of code. Over 1 million lines of production application code, all built by Codex agents. Build time: 1/10th of handwritten code.

"Humans steer. Agents execute."

For OpenAI, a harness is a declarative constraint system. You describe "what should be." The agent figures out "how."

Focus areas:

Scaling massive projects
Parallel agent execution
Sandboxed safety guarantees

Anthropic: "Manage the Context. Your Model Gets Anxious."

Anthropic took a completely different starting point.

Where OpenAI started with "let's automate an entire large-scale project," Anthropic started with "how do we keep a long-running agent stable?"

Their unique concept: context anxiety. When the context window fills up with information, model output quality degrades. Like a human in a 3-hour meeting with no agenda -- the AI starts making worse decisions.

With Claude Sonnet 4.5, context anxiety was strong enough that compaction (summarization) alone could not maintain performance on long tasks. Context resets became essential.

Their solution: periodic resets, with claude-progress.txt and git history carrying state to the next session.

Another Anthropic distinction: simplification to single agents. They originally designed multi-agent architectures, but as models got smarter, a single agent with proper harness became sufficient.

Focus areas:

Context management (avoiding anxiety)
Lifecycle management (session handoffs)
GAN-inspired Generator-Evaluator structure

LangChain: "Agent = Model + Harness. Here's the Proof."

LangChain's official blog post, "The Anatomy of an Agent Harness," has the simplest definition:

Agent = Model + Harness
The model provides intelligence. The harness makes that intelligence useful.

Then they did something nobody else did -- they showed the numbers. Harness improvements alone pushed benchmark accuracy from 52.8% to 66.5%. Same model. Only the harness changed.

That's a 13.7-point improvement without touching the model. Hard to argue with.

Focus areas:

Model-agnostic harness design principles
Quantitative evidence
LangGraph (orchestration) + LangSmith (observability)

Birgitta Böckeler (martinfowler.com): "Your Codebase IS the Harness"

Böckeler's angle is entirely different from the others.

A strongly-typed codebase naturally turns type checking into a sensor. Well-defined module boundaries provide architectural constraints. The framework implicitly raises the agent's success rate.

In other words: before you write AGENTS.md, your codebase itself is already part of the harness.

TypeScript strict mode acts as an unintentional quality gate for agents. Rust's borrow checker is the strongest implicit harness. Next.js App Router conventions are also implicit harness.

Where others discuss "how to build a harness," Böckeler asks "how to build a codebase that's harness-friendly."

Focus areas:

Constraints inherent in code
Rediscovered value of type safety, tests, linters
"Harness isn't bolted on -- it's built in"

arXiv: "Formalize the Harness as Specification"

Academic research takes yet another cut. The arXiv paper (2603.25723) proposes:

Externalize harness pattern logic as readable and executable objects.

Instead of treating harness as "best practices that feel right," formalize it as a verifiable specification.

A key insight from the paper:

Even as models grow more capable, harness-level controls -- roles, contracts, verification gates, persistent state, delegation boundaries -- remain important when specified in natural language.

AGENTS.md constraints don't lose value as models get smarter, because they're harness specifications, not prompts.

The Comparison: What's Same, What's Different

	OpenAI	Anthropic	LangChain	Böckeler (mf.com)	Academic
Metaphor	Steering wheel	Horse reins	Car chassis	Code types	Spec document
Starting point	1M-line experiment	Stability issues	Benchmarks	Code quality	Research
Focus	Declarative constraints	Context management	Model-agnostic	Implicit constraints	Formalization
Unique concept	Agent-first	Context anxiety	Agent=M+H	Harness-friendliness	Delegation boundaries
Agent model	Parallel scaling	Single-agent	Framework	Codebase-dependent	Pattern externalization

Where Everyone Agrees

What's outside the model matters more than what's inside
Constraints should be enforced, not suggested
Feedback loops are non-negotiable
Prompt engineering doesn't disappear (it's contained within the harness)

Where They Disagree

Multi vs. single agent:

OpenAI: parallel scaling is the future
Anthropic: one smart agent with proper harness is enough

Harness granularity:

OpenAI: one harness wraps the entire project
Böckeler: a single type check counts as harness

"Replacement" vs. "evolution":

Replacement camp: "Agents have outgrown what prompts can handle"
Evolution camp: "No fundamental difference -- just reflecting increased LLM capability"

So What Should You Actually Do Tomorrow?

The interpretation differences are interesting, but your next steps are straightforward:

Step 1: Write AGENTS.md / CLAUDE.md
(If you haven't yet, do it today. 500 words is enough.)

Step 2: Automate quality gates
(Force linters, type checks, and tests through hooks.)

Step 3: Run the feedback loop
(Agent makes mistake → add constraint to AGENTS.md → it won't repeat the mistake.)

These three steps mean you're already practicing harness engineering. Worrying about which company's interpretation is "correct" can wait until you've run Step 3 for three months.

📚 I wrote a book that goes deeper into all five interpretations. 14 chapters covering the 6 core components, hooks/lifecycle design, feedback loops, and Self-Evolving Agents.

👉 Harness Engineering -- From Using AI to Controlling AI