TL;DR — Harness engineering is one thing: getting reliable judgment out of a reasoner you didn't train — given an instruction, bounded by a spec that can override that instruction, and verified before the result ships. It is not "the wrapper around the model." It's a property of code, not a place in your stack, and it lives on both sides of every tool call. The generic scaffolding — loops, dispatch, memory — melts into the model a little more every quarter. What survives is the part that stays external to the model: the specification of what the agent may do, and the verification that it did. That's the discipline. The rest is plumbing.
I've written before that Agent = Model × Harness, and that the harness is the half you actually engineer. I still believe the formula. But stop there and it oversells two things — so this is me correcting my own framing, precisely, with code.
The two things the formula gets wrong
The × oversells separability. Multiplication implies the factors are independent. They aren't. A better model doesn't leave your harness untouched — it dissolves parts of it. Chain-of-thought prompting became reasoning models. ReAct-style tool loops became native tool use. Half your RAG plumbing is being quietly absorbed by longer context and better-trained retrieval. Every model generation collapses a layer of harness the previous one needed.
"Harness absorbs software engineering" oversells the annexation. It doesn't absorb it. It partly consumes it — pulls in the slice that encodes the agent's behavior — while the deterministic spine (payments, ledgers, auth) and the dumb tools stay exactly what they were. Relocation with an annexation at the seam. Not a takeover.
Both corrections point at the same uncomfortable question: if the model eats the scaffold every quarter, isn't the harness the melting part — the thing you'd be foolish to bet a career on?
What melts, and what doesn't
Here's the resolution, and it's the whole point.
The harness mechanism melts. The loop, the dispatch, the retry, the memory-stitching — that's the part the model and the frameworks commoditize, and they should.
What doesn't melt is the part that's external to the model:
The model can get arbitrarily good at running an agent and still not know what your agent is for.
Your refund policy isn't in the weights and never will be. Your escalation thresholds, your risk tolerance, what this agent must never do regardless of how it's asked — that's private, contextual, and changing. It's a specification, and specifications live outside the thing they govern. Same for evaluation: a model can't be its own measurement layer without circularity — the thing being judged can't be trusted to judge itself.
So the durable core of harness engineering was never the machinery. It's specification and verification — defining the envelope of allowed behavior, and proving the agent stayed inside it. Both survive every model upgrade for the same reason: they're external to the model, one as constraint, one as adversary.
The boundary has no fixed address
The tempting picture is tidy: the harness wraps the model, and the tools sit outside it. That picture does not survive contact with agent-facing APIs.
Ask where the refund cap actually lives. In the agent's orchestration guardrails? Or in the refund endpoint that refuses the call? The honest answer is both — and the load-bearing copy is the one in the API, because you never trust a probabilistic reasoner to self-limit. Defense in depth pushes the hard constraints down into the tool. Which means the agent-facing endpoint is doing harness-grade work — specifying and constraining agency — from the far side of the boundary you thought separated harness from tool.
So harness engineering isn't a location in the stack. It's a property:
Does this code exist to bend a fallible reasoner toward a spec?
Run that test and the boundary stops mattering. The model-facing side — the code that talks to the model, elicits its judgment, constrains it — passes wholesale; that's its entire job. The service side passes selectively: a plain database or a human-facing endpoint is just a tool, but an endpoint purpose-built for an agent — idempotent, dry-runnable, policy-enforcing, with errors a model can parse and recover from — is harness. The Stripe-for-agents dev is doing harness engineering. The Stripe-for-humans dev whose endpoint an agent happens to call is not. Membership by function, not address.
And the model itself? Never. It's exogenous — the × you don't own.
What it looks like in code
Here's the whole discipline as one function. The comment headers are the taxonomy:
def handle_refund(instruction, ctx):
# ── MODEL-FACING (agent side): elicit judgment from the non-deterministic reasoner
decision = model.decide(
instruction=instruction, # per-task: "refund this customer"
policy=ctx.policy_summary, # standing spec, handed to the model
account=ctx.account,
) # -> { action: "refund", amount: 52000, reason: "..." }
# ── THE ENVELOPE: a spec that OUTRANKS the instruction.
# This is CODE, not a prompt. You can argue with a prompt.
# You cannot argue with an if-statement. That line is the discipline.
if decision.action == "refund" and decision.amount > ctx.policy.auto_approve_cap:
return escalate(decision, reason="over auto-approve cap")
# ── RUNTIME EVAL (inner loop): verify the judgment BEFORE the side effect commits
check = evals.verify(decision, ctx.policy)
if not check.passed:
log_failure(check) # outer loop: feeds harness re-tuning across versions
return escalate(decision, reason=check.violation)
# ── ENVIRONMENT-FACING (service side): agent-optimized tool, idempotent on purpose
return refund_api.execute(
amount=decision.amount,
idempotency_key=ctx.request_id, # because the model might retry
)
(model.decide, evals.verify, and refund_api.execute stand in for the real implementations.)
The teaching is in two lines.
The envelope if is what makes this harness engineering and not prompt engineering. If the cap lived in the prompt, the model could be cajoled past it — flattered, confused, jailbroken. In code, it can't. That single branch is "reliable judgment given an instruction, bounded by a spec that can override it" — compiled.
And evals.verify before execute is the judgment getting checked while it's still cheap to be wrong.
Now notice what's absent: the model itself (exogenous), and any human-facing refund UI (a tool, not harness). Run the membership test down every line — does this exist to bend a fallible reasoner to a spec? — and everything that remains is the harness.
The hard part isn't obedience — it's refusal
The naive read of "reliable behavior" is does the agent do what it's told. That's the easy half, and it's the half models are getting better at on their own.
The durable, hard half is the opposite:
Reliable judgment includes the agent refusing when the instruction would breach the spec.
The refund agent's hardest moment isn't executing "refund this customer." It's declining "refund $50,000" even though it was told to. That refusal doesn't come from the instruction — it comes from the envelope that outranks it. And making a probabilistic reasoner reliably disobey at exactly the right boundary, every time, is the whole safety problem in miniature. It's also the failure that doesn't surface until it's expensive — which is exactly why you verify it in code rather than hope for it in a prompt.
Evals are the feedback — and there are two of them
A discipline with a measured error signal you can drive down is engineering. Without it, it's prompt-craft. Evals are what close the loop — and there are two loops, nested at different timescales.
The inner loop is the runtime check that catches the bad judgment before it commits — the evals.verify above. It feeds the agent, mid-task. It's how the envelope gets enforced, and it saves the money now.
The outer loop is the offline suite that tells you "this agent breaches policy 3% of the time." It feeds you, across versions, and re-tunes the harness over weeks.
Same discipline, two controllers. One keeps a single decision inside the lines; the other keeps the system improving.
The whole thing, assembled
Step back and it's a control system, and every piece has a name:
- The model is the plant — powerful, non-deterministic, the thing you're controlling.
- The spec (instruction + overriding envelope) is the setpoint.
- The harness is the controller.
- The evals are the feedback.
That's harness engineering, fully factored.
The bet
Here's the part that reconciles "bet on the harness" with "the harness melts," because both are true.
Every individual harness melts. Each one has the model's shadow baked into it — its specific failure modes, its current reliability — and gets rewritten when the curve moves: fewer guardrails as the model gets dependable, different ones as its failure modes shift. The artifact is disposable.
The discipline doesn't melt. Something must always specify what an autonomous reasoner may do, and prove it stayed inside that — and that something sits external to the model, which is exactly why no model generation can absorb it.
The harness melts. Harness engineering doesn't.
So the bet isn't "build the loop" — the model ships more of the loop every quarter. The bet is: be the person who decides what the agent is allowed to want, and can prove it stayed there. Mechanism commoditizes. Specification and verification don't.
And the surface is only getting wider, because the consumer of software is changing — your users are becoming agents too. Every system that exposes itself to an autonomous reasoner now needs a face built for one. That face is harness work, wherever it sits.
The model half is being handed to all of us for free, and it gets better every quarter. Whether your agent — your product, your role — gets better is a question about the half you engineer: the judgment you specify, and the verification that holds it true.
That's where I'd place the bet. It just doesn't have a fixed address.
Top comments (0)