DEV Community

Ken Imoto
Ken Imoto

Posted on

Harness Engineering for AI Code Review -- How OpenAI, Anthropic, and HumanLayer Control Agent-to-Agent Review

The Problem: Code Review Can't Keep Up

AI agents now write code 10x faster than humans.

OpenAI's Codex team generated over 1 million lines of code in 5 months, with 3 engineers merging an average of 3.5 PRs/day each. Anthropic's long-running agents code continuously for 6+ hours.

New problem: code review can't keep up.

Imagine a factory line running 10x faster, but the quality inspectors are the same headcount. The inspection queue stretches out the door.

OpenAI's answer: agent-to-agent review -- AI reviews AI-written code. But "just ask AI to review" doesn't work. You need a control system. That system is the harness, and the discipline of designing it is harness engineering.

What Is Harness Engineering?

Mitchell Hashimoto (HashiCorp founder) defined it:

When an agent makes a mistake, improve the environment so it never makes the same mistake again.

HumanLayer positions this as a subset of Context Engineering.

Harness engineering = designing the configuration that manages an agent's context window. We went from tweaking prompts (prompt engineering) to designing entire environments (harness engineering).

OpenAI Codex Team's Approach

AGENTS.md as a "Table of Contents"

OpenAI initially built a massive AGENTS.md -- coding conventions, architecture decisions, project context, everything in one file. It failed.

Context is a scarce resource. A giant instruction file pushes out task details, code, and relevant documentation. When everything is "important," nothing is.

The fix: keep AGENTS.md to ~100 lines as a table of contents.

# AGENTS.md (~100 lines)

## Architecture
→ docs/architecture/overview.md

## API Conventions  
→ docs/api/conventions.md

## Testing
→ docs/testing/strategy.md

## Security
→ docs/security/guidelines.md
Enter fullscreen mode Exit fullscreen mode

Details live in docs/. The agent references them only when needed. This is Progressive Disclosure applied to AI context.

The Agent-to-Agent Review Loop

Here's OpenAI's actual flow:

  1. Codex generates code changes
  2. Codex runs its own local review
  3. Requests additional agent reviews (local + cloud)
  4. Responds to feedback and fixes
  5. Loops until all agent reviewers pass
  6. Humans intervene only on escalation

Humans step in for exactly 3 cases: new architecture decisions, security-sensitive changes, and product direction calls. Everything mechanical is agent-to-agent.

Educational Linter Design

OpenAI's custom linters embed "why" and "how to fix" in every error message:

ERROR: Module 'payments' imports from 'users' internal package.
WHY: Cross-module internal imports break module boundaries.
     See docs/architecture/module-boundaries.md
FIX: Use the public API: import { getUserById } from '@app/users'
Enter fullscreen mode Exit fullscreen mode

Error messages = teaching moments. The agent doesn't need to understand the entire architecture. It just needs clear feedback when it crosses a boundary.

Anthropic's Two-Phase Approach

Anthropic tackles a different angle: the "memory gap" problem in long-running agents.

Initializer Agent + Coding Agent

Session 1 (Initializer Agent):
  → Creates init.sh
  → Creates claude-progress.txt
  → Generates 200+ feature list as JSON (all passes: false)
  → Initial git commit

Session 2+ (Coding Agent):
  → Reads claude-progress.txt + git history
  → Implements exactly 1 feature
  → Confirms tests pass
  → Updates passes: true
  → Clean git commit
  → Hands off to next session
Enter fullscreen mode Exit fullscreen mode

The key is one feature at a time, incrementally. This structurally prevents the "try to do everything at once" failure mode.

JSON Over Markdown for Feature Lists

An interesting finding: Anthropic manages feature lists in JSON, not Markdown. The reason: "LLMs tend to improperly rewrite Markdown files, but JSON's strict structure makes it harder to tamper with."

{
  "category": "functional",
  "description": "New chat button creates a new conversation",
  "steps": [
    "Navigate to main interface",
    "Click new chat button",
    "Verify new conversation is created"
  ],
  "passes": false
}
Enter fullscreen mode Exit fullscreen mode

They pair this with a strong instruction: "Do NOT edit or delete tests. Do NOT change passes to true without actually running the test." They call unauthorized test modification "unacceptable."

You wouldn't want a student grading their own exam and reporting "100% correct!" JSON's strict format plus firm instructions: trust but verify.

HumanLayer's 6 Levers

HumanLayer organizes harness components into 6 levers:

Lever Role Code Review Application
System Prompt Base instructions Define review criteria
Tools / MCP External tool integration Invoke SAST/linters
Context Reference information Architecture docs
Sub-agents Task isolation Parallel review by concern
Hooks Automatic triggers Auto-review on PR creation
Skills Knowledge modules Security/performance review skills

Sub-agents deserve special attention. HumanLayer calls them "context firewalls." Run security review and performance review as separate sub-agents, and intermediate noise never pollutes the parent thread.

Getting Started: 4 Steps

Step 1: Build Deterministic Checks First

Before involving LLMs, automate what's automatable. Type checking, import boundary validation, naming conventions, test coverage thresholds. These are deterministic: same input, same output, every time.

LLMs handle what deterministic tools can't: design review, readability assessment, security pattern recognition. Use both layers together -- deterministic as the foundation, LLM-as-Judge on top.

Step 2: Define Review Criteria with PASS/FAIL

Not "check security" but "check for SQL injection, XSS, and auth bypass. FAIL if any are found." Explicit criteria that leave no room for interpretation.

Step 3: Isolate Concerns with Sub-agents

Don't give one agent all review concerns. Separate into security agent, performance agent, readability agent. Each gets a focused context window, uncontaminated by other concerns.

Step 4: Feed Failures Back Into the Harness

When AI review misses something:

  1. Add the case as a linter rule or evaluation criterion
  2. Incorporate as a regression test
  3. Guarantee the same failure never recurs

This is the core loop of harness engineering.

Summary

Organization Approach Core Insight
OpenAI Agent-to-agent review Humans only on escalation
Anthropic Initializer + Coding Agent One feature at a time, incrementally
HumanLayer 6 levers Sub-agent = context firewall
Martin Fowler Deterministic + LLM hybrid Custom linters = teaching moments

Harness engineering isn't "how to delegate work to AI." It's "how to build an environment where AI failures are safe."

You wouldn't ride a horse without reins. AI agents are the same: if you're going to let them run, design the harness first.


References

For more on Context Engineering and harness design, check out my book:
📕 MCP Server Security Practice Guide (Amazon Kindle)

Top comments (0)