Saurav Bhattacharya

Posted on Jun 20

Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It

#ai #agents #observability #evaluation

There's a formula I keep coming back to when people ask why their slick demo agent falls apart in production:

Agent = Model × Harness

The model is the raw reasoning — Claude, GPT, whatever. It's swappable, and it's getting better on a curve you don't control. The harness is everything else: the goals, the loops, the tools, the scheduler, the retry logic. Most of the engineering that matters lives in the harness, not the model.

But here's the part most teams get wrong. They define the harness as the plumbing to run the model — goals + loops + tools — and then bolt evals and observability on the side as external QA. Things you point at the agent, after the fact, from outside.

That's the mistake. Your eval layer and your trace layer are inside the harness. They're not tools beside the agent; they're the half of the agent that makes it a closed loop instead of an open one.

Harness = goals + loops + tools + lens + evals. The first three let the agent act. The last two let it know whether the action was any good — which is the only thing that turns an agent that runs into an agent that improves.

Let me make that concrete, because I learned it the unglamorous way: by watching a scheduled agent crash and discovering exactly which parts of my harness saved me.

Open loop vs closed loop

An open-loop agent acts and moves on. It writes the file, hits the API, ships the commit — and nobody, including the agent itself, knows if the outcome was correct. You find out when a human notices something's broken. Most "autonomous" agents in the wild are open-loop. They're impressive right up until they silently do the wrong thing for three days.

A closed-loop agent has a feedback path:

act → observe → evaluate → correct → act better

The two pieces that close it are the lens (observability — every model call and tool step, resolved inputs, raw outputs, so a run is never a black box) and the evals (judgment — the output scored against a standard: deterministic checks, contract validation, model-as-judge). Lens tells you what the agent did; evals tell you whether it was good. You need both, wired into the harness itself — not run by hand once a week by a human who remembers to check.

The incident that taught me this

I run a fleet of scheduled agents — background workers on cron, each doing a focused job on a timer, each in an isolated session the main process can't watch live.

One of them crashed mid-run.

In an open-loop setup, that's a silent disaster: the worker dies, produces nothing, leaves no trace. You discover it days later when the output that should exist doesn't. (If you've run background agents, you know this exact pain — the ones that "go dark" for a week before anyone notices.)

Here's what happened instead, because the lens and eval layers were part of the harness:

1. The lens meant the failure left a record. Every worker writes its transcript stub-first: the moment it starts, before doing any real work, it writes a record with Outcome: IN-PROGRESS. So even a worker that dies one second later has left a footprint. The crash wasn't invisible — there was a transcript sitting there, frozen mid-run, saying "I started and never finished."

2. The evals meant the failure got caught — loudly, and repeatedly. Worker transcripts are validated against a versioned contract. A stub stuck at IN-PROGRESS is fine in normal mode (it might still be running), but under the "finished" check it becomes an error:

# Normal: an unfinished stub is acceptable
agent-eval validate transcripts/

# Finished mode: a lingering IN-PROGRESS is a hard error
agent-eval validate transcripts/ --finished

So the dead run didn't just leave a record — it got scored as invalid, and kept showing up in the gate's invalid list every single run until someone dealt with it. The eval layer refused to let the failure quietly blend into the background.

That's the whole point. The lens made the failure observable; the evals made it un-ignorable. Neither is something I went and ran by hand — they're standing parts of the harness that turned a crash from a black hole into a tracked defect.

Closing the correction step

Detecting the failure is observe + evaluate. The last piece is correct.

The naive fix is "add a finally block so the worker finalizes its own transcript on crash." That doesn't work, and the reason is instructive: a crashed isolated session is gone. The process that would run the finally is the thing that died. You can't ask a dead worker to clean up after itself.

So the correction step has to live outside the worker, as its own small loop. I wrote a sweeper: a scheduled janitor that scans for transcripts stuck at IN-PROGRESS whose run is provably over — older than a threshold set safely past the longest possible run time, so it can never race a worker that's still legitimately going — and finalizes them to fail.

The logic is deliberately boring:

for each transcript stub still marked IN-PROGRESS:
    if age > MAX_RUN_TIME + safety_margin:   # the run is provably dead
        rewrite Outcome -> "fail (auto-finalized: abandoned mid-run)"

Idempotent, never deletes anything, runs on a timer. The first time it ran it finalized a backlog of long-dead stubs and dropped the gate's invalid count in finished-mode to just the known, expected exceptions. Now it keeps the loop closed automatically: a worker can still crash, but its corpse gets cleaned up within the hour without me touching anything.

That's act → observe → evaluate → correct, fully wired — and three of those four steps are the lens + eval layers doing their job.

Self-correcting ≠ self-improving (and why that matters)

Here's where I have to be honest, because this is the part it's tempting to oversell.

What I built is a self-correcting harness. The loop closes: the system detects deviations from its own standard and repairs them without a human in the loop. That's real, and it's the floor you want under any fleet of autonomous agents.

But self-correcting is not self-improving. Self-correcting means the system holds the line — it keeps itself at the standard. Self-improving would mean the system moves the line up on its own: noticing a check has false positives and tightening its own rubric, noticing a worker keeps regressing and adjusting that worker's instructions. My harness doesn't do that. I still author every improvement. The loops run themselves; the loops were designed by a human.

And honestly — that's the right place to stop, for now. A harness that rewrites its own workers and its own eval criteria unsupervised is a different and much riskier thing, precisely because the evals would be grading work the same system authored. The moment the thing being judged and the thing doing the judging are the same closed system with no external anchor, "it passes its own evals" stops meaning very much. Self-correction within a human-set contract is the sound version. Self-modification of the contract itself is where you want hard guardrails and a human gate before you flip it on.

So the honest formula, fully expanded:

Agent = Model × Harness, where Harness = goals + loops + tools + lens + evals — and it's the lens + evals that make the agent self-correcting. Getting from there to self-improving is a real step further, and it's one you should take deliberately, with a human holding the gate.

The takeaway

If you're building agents and your evals and traces are something you run occasionally, from outside, you don't have a closed loop — you have an open-loop agent with some QA scripts nearby. The upgrade isn't more tooling. It's a reframe:

Lens and evals belong inside the harness, running on every action, not on demand.
The lens makes failures observable; the evals make them un-ignorable — that's what converts a crash from an invisible black hole into a tracked defect.
Closing the correction step often means a separate loop, because the thing that failed can't always clean up after itself.
Self-correcting is the floor; self-improving is a further, deliberate step — keep a human on the gate before the system grades its own homework.

Agent = Model × Harness. The model is improving on its own. Whether your agent improves is a question about your harness — and specifically about whether you've wired the loop closed.

The two layers in this post map to two tools I use as a single unit. **agent-eval* is the eval framework — it scores and gates an agent's output (contract validation, drift, hallucination, staleness); the validate --finished check in the incident above is its work. AgentLens is the trace layer — it captures how the agent got there (every model and tool step, resolved inputs, raw outputs) so the eval signal is actually debuggable. agent-eval tells you something broke; AgentLens tells you why. They ship together on purpose: two halves of one feedback loop — which is exactly the argument of this post.*

Top comments (1)

Armorer Labs • Jun 21

This framing lands for me. The harness is where trust becomes enforceable.

The model can propose, but the harness decides what tools are available, what context is loaded, what evidence counts, what gets logged, and when the run is done. So evals that ignore the harness are evaluating a different system than the one users actually experience.

I would almost define an agent release as model + prompt/skills + tools + harness + receipt format.