"Harness Engineering" Has a Definition Problem
In February 2026, OpenAI published "Harness engineering: leveraging Codex in an agent-first world," and the term exploded overnight.
Within weeks, Anthropic released two guides. LangChain defined it on their official blog. Birgitta Böckeler wrote a deep analysis on martinfowler.com. An arXiv paper formalized the concept.
But here's the thing: they're all saying slightly different things.
Same word. Different metaphors. Different starting points. Different conclusions.
I read all five. Here's what I found.
The One Thing Everyone Agrees On
There's a nesting structure no one disputes:
Harness ⊃ Context ⊃ Prompt
SmartScope's article captures it best:
Writing "run the linter" in CLAUDE.md versus enforcing linter execution via hooks is the difference between "almost every time" and "every time, no exceptions."
Beyond this? It gets messy.
OpenAI: "Write Declarations. Don't Write Code."
OpenAI's article dropped a bombshell.
For 5 months, their engineers wrote zero lines of code. Over 1 million lines of production application code, all built by Codex agents. Build time: 1/10th of handwritten code.
"Humans steer. Agents execute."
For OpenAI, a harness is a declarative constraint system. You describe "what should be." The agent figures out "how."
Focus areas:
- Scaling massive projects
- Parallel agent execution
- Sandboxed safety guarantees
Anthropic: "Manage the Context. Your Model Gets Anxious."
Anthropic took a completely different starting point.
Where OpenAI started with "let's automate an entire large-scale project," Anthropic started with "how do we keep a long-running agent stable?"
Their unique concept: context anxiety. When the context window fills up with information, model output quality degrades. Like a human in a 3-hour meeting with no agenda -- the AI starts making worse decisions.
With Claude Sonnet 4.5, context anxiety was strong enough that compaction (summarization) alone could not maintain performance on long tasks. Context resets became essential.
Their solution: periodic resets, with claude-progress.txt and git history carrying state to the next session.
Another Anthropic distinction: simplification to single agents. They originally designed multi-agent architectures, but as models got smarter, a single agent with proper harness became sufficient.
Focus areas:
- Context management (avoiding anxiety)
- Lifecycle management (session handoffs)
- GAN-inspired Generator-Evaluator structure
LangChain: "Agent = Model + Harness. Here's the Proof."
LangChain's official blog post, "The Anatomy of an Agent Harness," has the simplest definition:
Agent = Model + Harness
The model provides intelligence. The harness makes that intelligence useful.
Then they did something nobody else did -- they showed the numbers. Harness improvements alone pushed benchmark accuracy from 52.8% to 66.5%. Same model. Only the harness changed.
That's a 13.7-point improvement without touching the model. Hard to argue with.
Focus areas:
- Model-agnostic harness design principles
- Quantitative evidence
- LangGraph (orchestration) + LangSmith (observability)
Birgitta Böckeler (martinfowler.com): "Your Codebase IS the Harness"
Böckeler's angle is entirely different from the others.
A strongly-typed codebase naturally turns type checking into a sensor. Well-defined module boundaries provide architectural constraints. The framework implicitly raises the agent's success rate.
In other words: before you write AGENTS.md, your codebase itself is already part of the harness.
TypeScript strict mode acts as an unintentional quality gate for agents. Rust's borrow checker is the strongest implicit harness. Next.js App Router conventions are also implicit harness.
Where others discuss "how to build a harness," Böckeler asks "how to build a codebase that's harness-friendly."
Focus areas:
- Constraints inherent in code
- Rediscovered value of type safety, tests, linters
- "Harness isn't bolted on -- it's built in"
arXiv: "Formalize the Harness as Specification"
Academic research takes yet another cut. The arXiv paper (2603.25723) proposes:
Externalize harness pattern logic as readable and executable objects.
Instead of treating harness as "best practices that feel right," formalize it as a verifiable specification.
A key insight from the paper:
Even as models grow more capable, harness-level controls -- roles, contracts, verification gates, persistent state, delegation boundaries -- remain important when specified in natural language.
AGENTS.md constraints don't lose value as models get smarter, because they're harness specifications, not prompts.
The Comparison: What's Same, What's Different
| OpenAI | Anthropic | LangChain | Böckeler (mf.com) | Academic | |
|---|---|---|---|---|---|
| Metaphor | Steering wheel | Horse reins | Car chassis | Code types | Spec document |
| Starting point | 1M-line experiment | Stability issues | Benchmarks | Code quality | Research |
| Focus | Declarative constraints | Context management | Model-agnostic | Implicit constraints | Formalization |
| Unique concept | Agent-first | Context anxiety | Agent=M+H | Harness-friendliness | Delegation boundaries |
| Agent model | Parallel scaling | Single-agent | Framework | Codebase-dependent | Pattern externalization |
Where Everyone Agrees
- What's outside the model matters more than what's inside
- Constraints should be enforced, not suggested
- Feedback loops are non-negotiable
- Prompt engineering doesn't disappear (it's contained within the harness)
Where They Disagree
Multi vs. single agent:
- OpenAI: parallel scaling is the future
- Anthropic: one smart agent with proper harness is enough
Harness granularity:
- OpenAI: one harness wraps the entire project
- Böckeler: a single type check counts as harness
"Replacement" vs. "evolution":
- Replacement camp: "Agents have outgrown what prompts can handle"
- Evolution camp: "No fundamental difference -- just reflecting increased LLM capability"
So What Should You Actually Do Tomorrow?
The interpretation differences are interesting, but your next steps are straightforward:
Step 1: Write AGENTS.md / CLAUDE.md
(If you haven't yet, do it today. 500 words is enough.)
Step 2: Automate quality gates
(Force linters, type checks, and tests through hooks.)
Step 3: Run the feedback loop
(Agent makes mistake → add constraint to AGENTS.md → it won't repeat the mistake.)
These three steps mean you're already practicing harness engineering. Worrying about which company's interpretation is "correct" can wait until you've run Step 3 for three months.
📚 I wrote a book that goes deeper into all five interpretations. 14 chapters covering the 6 core components, hooks/lifecycle design, feedback loops, and Self-Evolving Agents.
👉 Harness Engineering -- From Using AI to Controlling AI (Kindle)
References
- OpenAI: Harness engineering: leveraging Codex in an agent-first world
- Anthropic: Claude Code: Best practices for agentic coding
- Anthropic: Building effective agents
- LangChain: The Anatomy of an Agent Harness
- Birgitta Böckeler (martinfowler.com): Harness Engineering
- arXiv: Agent Harness Pattern Logic (2603.25723)

Top comments (0)