The Problem: Code Review Can't Keep Up
AI agents now write code 10x faster than humans.
OpenAI's Codex team generated over 1 million lines of code in 5 months, with 3 engineers merging an average of 3.5 PRs/day each. Anthropic's long-running agents code continuously for 6+ hours.
New problem: code review can't keep up.
Imagine a factory line running 10x faster, but the quality inspectors are the same headcount. The inspection queue stretches out the door.
OpenAI's answer: agent-to-agent review -- AI reviews AI-written code. But "just ask AI to review" doesn't work. You need a control system. That system is the harness, and the discipline of designing it is harness engineering.
What Is Harness Engineering?
Mitchell Hashimoto (HashiCorp founder) defined it:
When an agent makes a mistake, improve the environment so it never makes the same mistake again.
HumanLayer positions this as a subset of Context Engineering.
Harness engineering = designing the configuration that manages an agent's context window. We went from tweaking prompts (prompt engineering) to designing entire environments (harness engineering).
OpenAI Codex Team's Approach
AGENTS.md as a "Table of Contents"
OpenAI initially built a massive AGENTS.md -- coding conventions, architecture decisions, project context, everything in one file. It failed.
Context is a scarce resource. A giant instruction file pushes out task details, code, and relevant documentation. When everything is "important," nothing is.
The fix: keep AGENTS.md to ~100 lines as a table of contents.
# AGENTS.md (~100 lines)
## Architecture
→ docs/architecture/overview.md
## API Conventions
→ docs/api/conventions.md
## Testing
→ docs/testing/strategy.md
## Security
→ docs/security/guidelines.md
Details live in docs/. The agent references them only when needed. This is Progressive Disclosure applied to AI context.
The Agent-to-Agent Review Loop
Here's OpenAI's actual flow:
- Codex generates code changes
- Codex runs its own local review
- Requests additional agent reviews (local + cloud)
- Responds to feedback and fixes
- Loops until all agent reviewers pass
- Humans intervene only on escalation
Humans step in for exactly 3 cases: new architecture decisions, security-sensitive changes, and product direction calls. Everything mechanical is agent-to-agent.
Educational Linter Design
OpenAI's custom linters embed "why" and "how to fix" in every error message:
ERROR: Module 'payments' imports from 'users' internal package.
WHY: Cross-module internal imports break module boundaries.
See docs/architecture/module-boundaries.md
FIX: Use the public API: import { getUserById } from '@app/users'
Error messages = teaching moments. The agent doesn't need to understand the entire architecture. It just needs clear feedback when it crosses a boundary.
Anthropic's Two-Phase Approach
Anthropic tackles a different angle: the "memory gap" problem in long-running agents.
Initializer Agent + Coding Agent
Session 1 (Initializer Agent):
→ Creates init.sh
→ Creates claude-progress.txt
→ Generates 200+ feature list as JSON (all passes: false)
→ Initial git commit
Session 2+ (Coding Agent):
→ Reads claude-progress.txt + git history
→ Implements exactly 1 feature
→ Confirms tests pass
→ Updates passes: true
→ Clean git commit
→ Hands off to next session
The key is one feature at a time, incrementally. This structurally prevents the "try to do everything at once" failure mode.
JSON Over Markdown for Feature Lists
An interesting finding: Anthropic manages feature lists in JSON, not Markdown. The reason: "LLMs tend to improperly rewrite Markdown files, but JSON's strict structure makes it harder to tamper with."
{
"category": "functional",
"description": "New chat button creates a new conversation",
"steps": [
"Navigate to main interface",
"Click new chat button",
"Verify new conversation is created"
],
"passes": false
}
They pair this with a strong instruction: "Do NOT edit or delete tests. Do NOT change passes to true without actually running the test." They call unauthorized test modification "unacceptable."
You wouldn't want a student grading their own exam and reporting "100% correct!" JSON's strict format plus firm instructions: trust but verify.
HumanLayer's 6 Levers
HumanLayer organizes harness components into 6 levers:
| Lever | Role | Code Review Application |
|---|---|---|
| System Prompt | Base instructions | Define review criteria |
| Tools / MCP | External tool integration | Invoke SAST/linters |
| Context | Reference information | Architecture docs |
| Sub-agents | Task isolation | Parallel review by concern |
| Hooks | Automatic triggers | Auto-review on PR creation |
| Skills | Knowledge modules | Security/performance review skills |
Sub-agents deserve special attention. HumanLayer calls them "context firewalls." Run security review and performance review as separate sub-agents, and intermediate noise never pollutes the parent thread.
Getting Started: 4 Steps
Step 1: Build Deterministic Checks First
Before involving LLMs, automate what's automatable. Type checking, import boundary validation, naming conventions, test coverage thresholds. These are deterministic: same input, same output, every time.
LLMs handle what deterministic tools can't: design review, readability assessment, security pattern recognition. Use both layers together -- deterministic as the foundation, LLM-as-Judge on top.
Step 2: Define Review Criteria with PASS/FAIL
Not "check security" but "check for SQL injection, XSS, and auth bypass. FAIL if any are found." Explicit criteria that leave no room for interpretation.
Step 3: Isolate Concerns with Sub-agents
Don't give one agent all review concerns. Separate into security agent, performance agent, readability agent. Each gets a focused context window, uncontaminated by other concerns.
Step 4: Feed Failures Back Into the Harness
When AI review misses something:
- Add the case as a linter rule or evaluation criterion
- Incorporate as a regression test
- Guarantee the same failure never recurs
This is the core loop of harness engineering.
Summary
| Organization | Approach | Core Insight |
|---|---|---|
| OpenAI | Agent-to-agent review | Humans only on escalation |
| Anthropic | Initializer + Coding Agent | One feature at a time, incrementally |
| HumanLayer | 6 levers | Sub-agent = context firewall |
| Martin Fowler | Deterministic + LLM hybrid | Custom linters = teaching moments |
Harness engineering isn't "how to delegate work to AI." It's "how to build an environment where AI failures are safe."
You wouldn't ride a horse without reins. AI agents are the same: if you're going to let them run, design the harness first.
References
- Harness engineering (OpenAI)
- Harness Engineering (Martin Fowler)
- Effective harnesses for long-running agents (Anthropic)
- Skill Issue: Harness Engineering for Coding Agents (HumanLayer)
- Mitchell Hashimoto: My AI Adoption Journey
For more on Context Engineering and harness design, check out my book:
📕 MCP Server Security Practice Guide (Amazon Kindle)
Top comments (0)