Harness Engineering Emerges
Part 5 of a chronological survey of the craft around large language models. Part 4 ended on a warning: in agentic systems, small errors compound catastrophically. This installment is the response — as coding agents went mainstream, attention widened from the context you feed a model to the entire system you build around it: the harness.
TL;DR — Agent = Model + Harness. If you're not the model, you're the harness — the guides and sensors (feedforward and feedback) that keep an agent on track. Four industrial accounts (OpenAI, Stripe, Google, Anthropic) converged on the same lesson: the durable work is designing the environment, not picking the model. And the harness ratchets — every mistake becomes a permanent rule.
A definition that reframes the work
By late 2025, the most useful coding tools — Claude Code, OpenAI's Codex, Cursor, Aider, Cline — were no longer "a model you prompt." They were elaborate systems in which a model sat at the center, surrounded by prompts, tools, file access, sandboxes, hooks, and feedback loops. Simon Willison gave the underlying object its durable definition in September 2025: an LLM agent is "something that runs tools in a loop to achieve a goal," and "the art of using them well is to carefully design the tools and loop for them to use." He also named the larger activity — agentic engineering: building software using coding agents whose defining feature is that they can both generate and execute code, letting them test and iterate independently of turn-by-turn human guidance.
The piece of that system other than the model got its own name, which the developer Viv Trivedy crystallized into a slogan that spread quickly:
Agent = Model + Harness. If you're not the model, you're the harness.
A raw model, on this view, is not an agent. It becomes one only when a harness gives it state, tool execution, feedback loops, and enforceable constraints. As Addy Osmani put it in his synthesis of the idea, the harness is "every piece of code, configuration, and execution logic that isn't the model itself" — system prompts, CLAUDE.md/AGENTS.md files, skills, tools, MCP servers, the sandbox and filesystem, orchestration logic, hooks, and observability.
The reframing matters because of where it locates the engineering leverage. Osmani's claim, echoing Trivedy and the team at HumanLayer, is blunt: "A decent model with a great harness beats a great model with a bad harness." The loud public debate is about the model — which is smartest, which writes the cleanest code. But the model is one input; the rest is the harness, and "increasingly the interesting engineering isn't in picking the model, it's in designing the scaffolding around it."
The most striking evidence for that claim is a benchmark data point Osmani relays from Trivedy and HumanLayer: on Terminal-Bench, a coding agent moved from roughly Top 30 to Top 5 by changing only the harness — same model. The explanation is that models are post-trained coupled to a particular harness; move a model into a better harness for your codebase, with sharper tools and tighter feedback, and you can unlock capability the original harness left on the floor. This is the opposite of the "just wait for the next model" narrative. As Osmani frames it, "the gap between what today's models can do and what you see them doing is largely a harness gap."
Two boundaries of one word
"Harness" gets used at different scopes, and it's worth separating them, because the rest of this part lives at the outermost layer. Birgitta Böckeler (writing on Martin Fowler's site) draws the distinction as concentric rings:
-
User harness (you build this) — AGENTS.md, skills, hooks, custom linters, review agents, CI sensors. Wraps:
- Builder harness (the agent vendor) — system prompt, tool set, retrieval. Wraps:
- Model — the thing being harnessed.
The model is the core. Around it, the coding-agent vendor builds an inner "builder harness" (the system prompt, the tool set, the code-retrieval mechanism). And around that, you — the user of a coding agent — build an outer "user harness" tailored to your codebase. Böckeler's articles, and most of this part, are about engineering that outermost ring. A well-built outer harness, she writes, does two things: it raises the odds the agent gets it right the first time, and it provides a feedback loop that self-corrects many issues before they ever reach a human — reducing review toil, raising quality, and wasting fewer tokens.
The mental model: guides and sensors, feedforward and feedback
Böckeler's central contribution is a vocabulary borrowed, deliberately, from cybernetics — the study of control systems. A harness acts like a governor, regulating the codebase toward a desired state using two kinds of control:
-
Guides (feedforward controls) anticipate the agent's behavior and steer it before it acts. They raise the probability of a good result on the first try. Examples: coding-convention docs,
AGENTS.md, skills, reference docs, how-to guides, codemods. - Sensors (feedback controls) observe after the agent acts and help it self-correct. Examples: type checkers, linters, tests, static analysis, AI review agents.
A harness with only feedforward "encodes rules but never finds out whether they worked." A harness with only feedback "keeps repeating the same mistakes." You need both. And each kind of control comes in two execution flavors:
- Computational — deterministic, fast, run by the CPU: tests, linters, type checkers, structural analysis. Milliseconds to seconds; reliable.
- Inferential — semantic, run by a model: AI code review, "LLM-as-judge." Slower, costlier, non-deterministic, but capable of judgment a linter can't make.
A particularly elegant idea in Böckeler's framing is that the best sensors produce feedback optimized for a model to consume — a custom linter message that doesn't just say "error" but includes instructions for how to fix it. She calls this "a positive kind of prompt injection": the sensor speaks to the agent in terms the agent can act on.
The steering loop
The human's job, in this picture, is to steer by iterating on the harness. When a problem recurs, you improve the guides and sensors so it becomes less likely or impossible next time. And because coding agents themselves make it cheap to build controls, agents can help write the structural tests, draft rules from observed patterns, scaffold custom linters, or generate how-to guides from codebase archaeology. The harness becomes self-reinforcing.
Keep quality left
Böckeler maps the controls onto the software lifecycle with a "shift left" principle: run the cheap, fast checks as early as possible (before commit, before integration), and reserve expensive ones (broad AI review, mutation testing) for later in the pipeline. The earlier an issue is caught, the cheaper it is to fix — a continuous-integration intuition, now applied to agent output.
What a harness regulates, and Ashby's Law
Böckeler distinguishes three regulation targets: a maintainability harness (internal code quality — the easiest, with mature tooling), an architecture-fitness harness (performance, observability, and other architectural characteristics — Fowler's "fitness functions"), and a behaviour harness (does the app actually do what's needed — "the elephant in the room," and still largely unsolved). She invokes Ashby's Law of Requisite Variety from cybernetics: a regulator must have at least as much variety as the system it governs. Since an LLM can produce almost anything, committing to a constrained architecture (a fixed topology, predictable structure) is a variety-reduction move that makes a comprehensive harness achievable. This reappears, concretely, in OpenAI's practice below.
Sensors in practice
Böckeler's hands-on follow-up walks through building maintainability sensors on a real TypeScript/Next.js app, and the findings sharpen the abstract model:
- Basic linting (ESLint) caught the low-hanging AI failure modes — over-long functions and files, too many arguments, high cyclomatic complexity — but these rules weren't even on by default; she had to configure them, and notes that linters may evolve presets aimed at known agent failure modes. Crucially, she rewrote lint messages into self-correction guidance (the "positive prompt injection"), even letting the agent suppress a warning with a written reason or slightly raise a threshold — keeping constraints visible and reviewable rather than forcing a binary comply-or-suppress choice.
- Dependency rules (dependency-cruiser) enforced layered module boundaries (e.g. "clients must not import from services"), with error messages that re-explain the layering so the agent self-corrects. She notes AI absorbed the tool's steep configuration cost almost entirely.
- Coupling metrics alone proved not very useful to an AI — raw import/call graphs are noisy without semantic interpretation, and the model flagged legitimate patterns (a DI factory, a shared schema) as problems.
- AI modularity review (an inferential sensor, using a powerful prompt) was the standout — it found real, valid issues a human reviewer would care about: duplicated route code, inconsistent backend-calling patterns, a parameter object that should have been introduced (a change that had once touched 40+ files), auth logic sitting in the wrong place. This is "garbage collection" as a recurring inferential sensor.
- The test suite as a regression sensor, with mutation testing (Stryker) to catch the gap coverage misses: a file showed 100% statement coverage but had 13 surviving mutants and no real unit tests — coverage proved a line ran, not that its behavior was verified. As teams let AI write most tests, Böckeler argues mutation testing becomes crucial to monitor test effectiveness.
Her honest caveats are part of the contribution: computational sensors impressed her at the file/ function level but were noisy across module boundaries; she worried about feedback overload sending an agent into over-engineered refactoring spirals; and she flagged emerging conflicts between sensors (a max-lines rule pushing complexity into ever-longer component property chains). The sensors raised her trust and improved her review experience — but did not remove the human.
Industry at scale: three accounts
What makes early 2026 the moment harness engineering "emerged" is that several organizations published detailed accounts of doing it at scale. Read together, they triangulate the same practices.
OpenAI: a million lines, zero hand-written
OpenAI's "Harness engineering: leveraging Codex" (February 2026) describes building and shipping an internal product with zero lines of manually-written code — every line written by Codex, in roughly one-tenth the time hand-coding would have taken. Starting from an empty repository in August 2025, a small team drove Codex to about a million lines of code and ~1,500 merged pull requests, at an average of 3.5 PRs per engineer per day — throughput that increased as the team grew. The governing philosophy: "Humans steer. Agents execute."
The lessons are a catalogue of harness engineering:
- The role of the engineer changes. Early progress was slow "not because Codex was incapable, but because the environment was underspecified." The job became: when the agent fails, ask "what capability is missing, and how do we make it both legible and enforceable for the agent?" — then have Codex build that capability.
- Make the application legible to the agent. They wired the Chrome DevTools Protocol and a local observability stack into the agent runtime so Codex could drive the UI, read logs (LogQL) and metrics (PromQL), reproduce bugs, and verify its own fixes. Prompts like "ensure service startup completes in under 800ms" became tractable. Single Codex runs sometimes worked for six-plus hours, often overnight.
-
Repository knowledge as the system of record — a map, not a manual. The "one big
AGENTS.md" approach failed predictably: it crowded out the task, became non-guidance ("when everything is important, nothing is"), rotted instantly, and resisted verification. The fix: treatAGENTS.mdas a ~100-line table of contents pointing into a structureddocs/directory (design docs, execution plans, product specs, a quality-grade doc), enforced by linters and CI, with a recurring "doc-gardening" agent opening fix-up PRs for stale docs. They call this progressive disclosure: start the agent with a small stable entry point, teach it where to look next. - Agent legibility is the goal. "From the agent's point of view, anything it can't access in-context while running effectively doesn't exist." Knowledge in Slack threads or people's heads is invisible; it must be encoded into the repo as versioned artifacts. They favored "boring" technologies the model could fully model, and sometimes reimplemented small utilities rather than depend on opaque packages.
- Enforce invariants, not implementations. A rigid layered architecture (Types → Config → Repo → Service → Runtime → UI, with cross-cutting concerns entering through one explicit interface) enforced mechanically by custom, Codex-written linters and structural tests. "This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it's an early prerequisite." (This is Ashby's Law in practice: constrain the variety to make the system governable.)
- Throughput changes the merge philosophy. Minimal blocking merge gates, short-lived PRs, flakes handled with re-runs rather than indefinite blocking — "in a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive."
- Entropy and garbage collection. Agents replicate existing patterns, including bad ones, so the codebase drifts. The team initially spent every Friday — 20% of the week — cleaning "AI slop," which didn't scale. They replaced it with encoded "golden principles" and recurring background Codex tasks that scan for deviations and open targeted refactoring PRs, most auto-merged in under a minute. "Technical debt is like a high-interest loan": pay it down continuously.
Their concluding line is the thesis of this whole part: "Our most difficult challenges now center on designing environments, feedback loops, and control systems."
Stripe: one-shot agents at PR scale
Stripe's "Minions" write-up (Alistair Gray, February 2026) describes homegrown coding agents responsible for more than a thousand merged pull requests per week — humans review the code, but minions write it start to finish, with no human-written code.
The context that makes this hard is specific. Stripe's codebase spans hundreds of millions of lines across a few large repos, mostly Ruby (not Rails) with Sorbet typing — "a relatively uncommon stack" full of homegrown libraries "natively unfamiliar to LLMs" — and it moves well over a trillion dollars of payment volume in production, under real regulatory and compliance constraints.
Stripe's stated principle is the same one Osmani and Böckeler arrive at: "if it's good for humans, it's good for LLMs, too." Minions use the same developer tooling as human engineers. Several harness pieces stand out, each mapping onto the guides/sensors model:
- Devbox — the sandbox. A minion run starts in an isolated developer environment, the same kind of machine Stripe engineers write code on, pre-warmed to spin up in about ten seconds with Stripe code and services pre-loaded. Because devboxes are isolated from production and the internet, minions run "without human permission checks," and this gives parallelization "without the overhead of something like git worktrees, which wouldn't scale at Stripe." This is the embodiment of Osmani's point that your dev environment is your agent environment.
- An opinionated orchestration loop. The core agent loop runs on a fork of Block's open-source agent goose, customized to "interleave agent loops and deterministic code — for git operations, linters, testing, and so on — so that minion runs mix the creativity of an agent with the assurance that they'll always complete Stripe-required steps like linters." In harness terms, the deterministic steps are computational guides and sensors; the agent loop is where inference happens. (Stripe's Part 2 names this structure "blueprints.")
- Rules and Toolshed — the context. Minions read the same coding-agent rule files that human-operated tools like Cursor and Claude Code do; because Stripe can't have many unconditional rules, almost all are "conditionally applied based on subdirectories" — the same scoping discipline Part 4 calls context curation. For everything not in the filesystem, minions use MCP, and Stripe built a central internal MCP server called Toolshed hosting more than 400 MCP tools across internal systems and SaaS platforms; minions and other agents connect to "configurable but curated subsets" rather than the full set. Stripe also "deterministically run[s] relevant MCP tools over likely-looking links before a minion run even starts, to better hydrate the context" — proactive context-loading as a harness step.
- The feedback loop — the sensors, "shifted left." Stripe's framing is explicit: any lint that would fail in CI should be "enforced in the IDE or on a git push, and presented to the engineer immediately." Concretely: (1) an automated local executable uses heuristics to select and run relevant lints on each git push in under five seconds; (2) if that passes, CI selectively runs tests from Stripe's battery of over three million, auto-applying fixes where they exist and sending un-autofixable failures back to the minion; (3) a hard cap of at most two CI rounds — push, fix once, and stop. Stripe is candid about why: CI runs "cost tokens, compute, and time," and there are "diminishing marginal returns for an LLM to run many rounds of a full CI loop." This is a deliberate, economically-reasoned answer to the infinite-retry failure mode.
The throughline is the same guides-and-sensors pattern as OpenAI's account — a different company, a different stack, at scale — and it doubles as a case study in shifting feedback left and curating context, the lessons of Parts 3 and 4 made industrial.
Anthropic: harnesses for long-running work
Anthropic's "Effective harnesses for long-running agents" (November 2025) — which Osmani calls the best public breakdown of designing a harness for long-running work — tackles the problem of agents working across many context windows, where each new session starts with no memory of the last (like engineers working shifts with amnesia between them). Their finding: compaction alone is insufficient. Even a frontier model in a loop fails to build a production app from a high-level prompt, in two characteristic ways — trying to one-shot the whole app (and running out of context mid-feature), or, later, looking around, seeing progress, and prematurely declaring the job done.
Their two-part solution is pure harness design.
An initializer agent runs once, on the first session: it sets up the environment — a structured JSON feature list (200+ end-to-end features, all initially marked "failing"), an init.sh script, a progress file, and an initial git commit.
A coding agent then runs every subsequent session. It reads the progress file and git log to get oriented, runs a basic end-to-end test, works on exactly one feature, verifies it with real browser automation (not just unit tests), and leaves the environment clean and committed for the next session.
The feature list uses JSON specifically because the model is less likely to inappropriately edit it than Markdown — with strongly-worded instructions that removing or editing tests is "unacceptable." It's the same "leave clear artifacts for the next shift" discipline OpenAI encoded in its docs directory: externalized memory as a harness component.
A fourth account: Google productizes the harness
By mid-2026, Google had taken the harness concept and turned it into product surface. Its agent-first development platform, Antigravity, ships a Managed Agents API whose framing is explicitly harness-shaped: in Google Cloud's own words, "the agent harness runs on our servers, and each agent has its own ephemeral sandbox provisioned with your skills, Model Context Protocol (MCP) servers, and server-side tools." A single API call provisions a sandboxed agent that can write, execute, and iterate on code; Antigravity itself can spin subagents from one prompt and run multi-agent orchestration in parallel. This is the harness — sandbox, tools, skills, subagents — sold as managed infrastructure rather than assembled by hand, roughly the "Harness-as-a-Service" direction this part's synthesis anticipates.
Google DeepMind also published a small but instructive experiment on one harness component in isolation: the agent skill. The problem it targets is universal — a model's training data goes stale, so it doesn't know about new SDK versions or shifted best practices.
They built a gemini-api-dev skill and an evaluation harness of 117 prompts generating Python/TypeScript against the Gemini SDKs. The result was a clean demonstration of how much a single guide can matter: the skill lifted pass rates dramatically from low baselines (Gemini 3.1 Pro from 28%; newer 3.0-series models from as low as 6.8%), and helped across almost all task categories — a concrete, measured version of the "feedforward guide" idea.
Notably, DeepMind also flagged the maintenance problem Fowler raises: skills have "no great update story," so stale skills could eventually "do more harm than good" — the same drift-and-versioning worry that haunts every guide.
The ratchet, and why harnesses don't shrink
Two ideas from Osmani's synthesis tie the practice together and point forward.
The ratchet: every mistake becomes a rule. The defining habit of harness engineering is treating agent mistakes as permanent signals, not one-off bad runs. The agent commits a commented-out test; the next AGENTS.md says never do that, the next pre-commit hook greps for .skip(, the next reviewer subagent flags it as a blocker.
Every line in a good
AGENTS.mdshould trace back to a specific thing that once went wrong.
You add constraints only after a real failure, and remove them only when a more capable model makes them redundant. This is also why a harness can't be downloaded: "the right harness for your codebase is shaped by your failure history." (Osmani and HumanLayer both stress the corollary: keep AGENTS.md short — HumanLayer keeps theirs under 60 lines — because every line competes for the model's attention. A pilot's checklist, not a style guide. This is the same lesson OpenAI learned the hard way with its monolithic-manual failure.)
Harnesses don't shrink, they move. The naive expectation is that better models make harnesses obsolete. Osmani, drawing on Anthropic, argues the opposite: as models improve, the useful harness work moves rather than disappearing.
When Opus 4.6 largely killed the "context-anxiety" failure mode (earlier models wrapping up prematurely as they neared a perceived context limit), a whole class of anxiety-mitigation scaffolding became dead code. But the higher ceiling brought new failure modes — multi-day memory, coordinating several specialized agents, judging design quality in generated UIs — that need new scaffolding.
As Anthropic puts it: "every component in a harness encodes an assumption about what the model can't do on its own." When the model gets better at something, that component comes out. When it unlocks something new, new scaffolding goes in to reach the new ceiling.
There's a feedback loop underneath this, which Trivedy names: today's agent products are post-trained with harnesses in the loop, so models get specifically better at the actions harness designers built around — filesystem ops, bash, planning, subagent dispatch — which is why the same model can feel different in different harnesses, and why a harness is "a living system, not a config file you set up once." Osmani also relays Trivedy's Harness-as-a-Service framing: the industry is shifting from building on LLM APIs (which return a completion) to building on harness APIs / agent SDKs (which return a runtime — the loop, tools, context management, hooks, and sandbox out of the box).
Where this leaves us
By early 2026, harness engineering had a definition (Agent = Model + Harness), a mental model (guides and sensors, feedforward and feedback, computational and inferential), hands-on sensor practice (Böckeler), and four detailed industrial accounts (OpenAI, Stripe, Google, Anthropic) all converging on the same conclusion: the hard, valuable work is designing the environment and the feedback loops, not picking the model.
Pick the best model and you've done the easy part. The work is everything you build around it.
But notice what keeps recurring in every account: the loop. Anthropic's session-to-session cycle, OpenAI's six-hour autonomous runs, Stripe's one-shot end-to-end minions, the ratchet that re-injects fixes. The harness is the static structure; the loop is what runs inside it. And as Part 6 and Part 7 show, two further things were happening in parallel — a reckoning with the cost of all those hand-written prompts and rules (prompt debt), and the elevation of the loop itself into the primary object of design (loop engineering).
Key sources for Part 5
- Simon Willison, Designing agentic loops (Sept 2025) and Agentic Engineering Patterns (2025–26) — "an LLM agent runs tools in a loop to achieve a goal"; the definition of agentic engineering (generate and execute code); agent safety / YOLO mode.
- Viv Trivedy, Anatomy of an Agent Harness / "Agent = Model + Harness"; HumanLayer, "skill issue" harness framing; Terminal-Bench Top 30 → Top 5 by harness change. [Via Osmani, with attribution and quotation.]
- Birgitta Böckeler (martinfowler.com), Harness engineering for coding agent users (Apr 2026) — guides/sensors, feedforward/feedback, computational/inferential, the steering loop, "keep quality left," regulation categories, Ashby's Law, harness templates.
- Birgitta Böckeler, Maintainability sensors for coding agents (May 2026) — ESLint, dependency-cruiser, coupling metrics, AI modularity review, mutation testing; self-correction guidance; honest caveats.
- OpenAI (Ryan Lopopolo), Harness engineering: leveraging Codex in an agent-first world (Feb 2026) — 1M LOC / 0 hand-written; AGENTS.md as table of contents; agent legibility; invariant enforcement; garbage collection; "designing environments, feedback loops, control systems."
- Stripe (Alistair Gray), Minions: Stripe's one-shot, end-to-end coding agents, Parts 1 & 2 (Feb 2026) — 1,000+ PRs/week, no human-written code; Devbox (isolated ~10s dev env, parallel without git worktrees); core loop a fork of Block's goose interleaving agent + deterministic code; subdirectory-scoped rule files shared with Cursor/Claude Code; MCP + Toolshed (400+ tools, curated subsets); shift-feedback-left with local lints <5s, selective CI from 3M+ tests, at most two CI rounds. [Part 1 verified from primary text; "blueprints" framing is from Part 2.]
- Anthropic, Effective harnesses for long-running agents (Nov 2025) — initializer vs coding agent; feature-list JSON; progress files; why compaction alone fails; end-to-end self- verification.
- Addy Osmani, Agent Harness Engineering (Apr 2026) — the synthesis: harness anatomy, the "skill issue" reframe, the ratchet, harnesses move rather than shrink, the model-harness training loop, Harness-as-a-Service.
Next up · Part 6 — Prompt Debt & the Limits of Natural Language: why the brittleness prompt engineering hid finally got a name and a bill, and the proposed cure: specify with measurements, and stop writing prompts by hand.


Top comments (0)