Anthropic's $200 Experiment: How AI Success Rate Jumped From 20% to 100% With a Harness

#ai #opensource #python #devops

Summary

Anthropic ran a controlled experiment: Opus 4.5 solo ($9) = 20% success. Add a Harness (5 subsystems) = 100% success at $200. OpenAI confirmed with a million-line repo: one AGENTS.md file changed everything. Stop swapping models. Build your harness first.

The Experiment

Config	Cost	Success Rate
Opus 4.5 solo	$9	20%
Opus 4.5 + Harness	$200	100%

The $191 premium was all verification loops: compile, test, lint, type check.

The 5 Harness Subsystems

Subsystem	What It Prevents
Instructions	Agent doesn't know project conventions
Tools	Unauthorized operations, accidental deletes
Environment	"Works on my machine" syndrome
State	Cross-session amnesia
Feedback	Premature victory declarations

The 3 Fatal Failure Modes

Premature Victory — Agent writes 500 lines, declares "done", CI goes red.
Fix: Pre-commit hook: npx tsc --noEmit

Context Amnesia — Agent adds feature but breaks existing one.
Fix: MEMORY.md — read previous state before acting.

Tool Abuse — Agent runs destructive commands without asking.
Fix: Tool whitelist.

OpenAI's Million-Line Confirmation

OpenAI added one AGENTS.md file (<100 lines) to a million-line repo. Success rate increase was comparable to Anthropic's findings.

Quick Wins: Build Your Harness Today

Priority	What	Time
🥇	`AGENTS.md` in repo root	30 min
🥇	Pre-commit CI (tsc/lint/test)	1 hour
🥈	`MEMORY.md` for session state	20 min
🥈	Tool whitelist config	30 min
🥉	`setup.sh` for environment	30 min

FAQ

Q: Is $200 a lot for $9 worth of work? A: The $200 run delivers working code. The $9 run delivers nothing.

Q: Does this apply to small projects? A: Yes. Even one file benefits from AGENTS.md + verification.

Q: Does it work with any AI? A: Yes. The pattern is model-agnostic.

The model is the engine. The harness is the steering wheel, brakes, and seatbelt.

DEV Community