Oresztesz Margaritisz

Posted on Jun 29 • Edited on Jul 1

Harness Engineering - Core Principles

#ai #sre #agents #programming

Harness engineering is the discipline of designing the environments, constraints, and feedback loops around AI coding agents that make them reliable at scale. The formula is: Agent = Model + Harness. Harness is everything that isn't the model: the infrastructure that governs how the agent operates, what it can access, and how it self-corrects.

Like Site Reliability Engineering (SRE) applies software engineering principles to operations problems, harness engineering applies infrastructure thinking to AI agents writing code. The shift-left principle runs throughout: fast, cheap controls (linting, type-checking) execute pre-commit or alongside the agent; expensive controls (mutation testing, architecture review) run post-integration in CI. As with SRE's toil elimination, the goal is not to remove human judgment but to direct it where it matters most.

Why Principles Matter

Principles are fundamental truths that serve as the foundations for a desired outcome. They are repeatable and generalizable. The SRE book defines them as follows: principles are the patterns, behaviors, and areas of concern that tell you what to value and why; practices tell you how to do it. Principles should describe what Harness Engineering stands for and what falls outside its scope. Additionally, they are the stable, durable rules that drive harness design decisions, while specific tools, configurations, and workflows are their practices.

Taking Gradual Steps in Creating Rules

Every new rule, configuration change, or additional control should trace to a specific past failure or hard external constraint. If it doesn't, it's noise.

Source: ETH Zurich study

Every agent mistake becomes a permanent rule. Only add constraints from real failures; only remove them when the model has made them redundant.

Source: Hashimoto

Rules may also originate from a topological or architectural commitment, not only a past failure. Ashby's Law: restricting the codebase to a well-defined topology narrows the solution space and makes comprehensive harness coverage achievable - these proactive architectural constraints are valid harness content even if no failure preceded them.

Source: Böckeler / Fowler

Shifting the Operating Model

"Humans steer. Agents execute." - humans design systems, agents write code.

Source: OpenAI / Lopopolo

Use the Same Infrastructure for Agents and Humans

Agents, coding assistants, and humans share the same linter, static analysis, CI steps, build, and release pipelines.

Source: OpenAI

Deterministic over Probabilistic

Telling an agent "follow standards" in a prompt ≠ wiring a linter that blocks the PR. Prefer enforcement over instruction.

Source: Augment Code

Probabilistic controls (guides, AGENTS.md, skills) are not a lesser substitute: They are a required complement. Distinguish between feedforward (guides that prevent mistakes before the agent acts) and feedback (sensors that catch them after). Using enforcement alone is feedback-only: repeated mistakes are never prevented at the source. Both directions are necessary; neither alone is sufficient.

Source: Böckeler / Fowler

Success is silent; failures are verbose: If typecheck passes, the agent hears nothing. If it fails, error text is injected into the loop for self-correction.

Source: HumanLayer

Quality Amplifies

The harness amplifies existing code quality in both directions. A clean, well-structured codebase with strong tests scales well under agent output. A degraded, undocumented, or flaky one degrades faster. Agents don't fix broken infrastructure; they replicate its patterns at volume.

Source: Stripe / Minions

Without mechanical encoding of quality standards, bad patterns compound exponentially. Human taste must be captured once and enforced continuously on every line of code and not left to agent judgment. Linting and style rules must be set to error, never warn; soft guidance is not enforcement.

Source: OpenAI / Lopopolo

Agent-only code without human-in-the-loop review scores 1.1/5 on maintainability versus 3.1/5 with human oversight (SIG research). Quality gates - including human review - are a structural layer of the harness, not an optional step to be traded off against delivery speed.

Source: OpenAI / Lopopolo, Böckeler / Fowler

The behavior correctness gap remains an open problem: current harnesses reliably catch structural and stylistic failures but cannot fully substitute for human judgment on semantic correctness. Harness engineering does not license removing quality gates: It determines where human judgment is directed.

Source: Böckeler / Fowler, Stripe / Minions

Effective Context Engineering

Progressive Disclosure: Don't dump everything into context at once. Reveal instructions, tools, and knowledge only when the task calls for them.

Source: Anthropic

Context Window Discipline: Performance degrades beyond ~40% context utilization. Keep agents in the "smart zone" by managing what enters context.

Source: Alex Lavaee

Harnesses Evolve with Models

As models improve, harness components shift to new frontiers rather than disappearing. Better models unlock harder tasks with new failure modes.

Source: Carlini / Anthropic compiler

Eliminate Toil

Engineers should not just "watch and observe" agents doing their work. Coding agents should be able to work on their own without human supervision, preferably running in a sandbox. This is analogous to continuously watching log output from servers: After a certain scale it becomes impossible.

However, the goal is to direct human input to where it matters most, not to eliminate it entirely. The harness should reduce, not replace, supervision. Failure modes such as misdiagnosis, overengineering, and functional incorrectness when requirements are unclear cannot be reliably caught by any sensor. Human judgment remains structurally necessary.

Source: SRE book, Böckeler / Fowler

SDD as a Prerequisite

The PEV (Plan-Execute-Verify) loop and effective context management require SDD. Specifications exist on a spectrum from short prompts to multi-file descriptions - the form varies but the need is constant.

Maintain a structured docs/ directory (design-docs/, exec-plans/, product-specs/) checked into the repository. All Slack alignment, design decisions, and Google Docs specs are converted to markdown and committed.

Source: OpenAI

A Slack thread can also be the primary specification source. The agent reads the full conversation (earlier messages, linked tickets, pasted errors) as task input. Scoped rule files in the repo (markdown, directory-scoped) document conventions and preferred patterns.

Source: Stripe / Minions

A functional specification (of varying levels of detail, from a short prompt to multi-file descriptions) is a required input for the agent.

Source: Böckeler / Fowler

Separate Thinking from Typing: Research and planning happen in controlled phases. Execution happens against a verified plan. Verification occurs via automated feedback.

Source: Lavaee / Four Pillars