DEV Community

Cover image for Harness Engineering: The Code Around the Model Is the Hard Part
Mehmet TURAÇ
Mehmet TURAÇ

Posted on

Harness Engineering: The Code Around the Model Is the Hard Part

Everyone benchmarks the model. Almost nobody benchmarks the harness — the loop, the tool dispatch, the context manager, the retry logic that wraps a raw inference call and turns it into something that can run unattended against production. In my experience building agentic platforms, swapping the model is a config change you ship in an afternoon. The harness is where the months go, and it's where reliability is actually won or lost.

This is the part that doesn't show up in demos. A demo agent calls a tool, gets a clean result, and prints a tidy answer. A production agent calls a tool that times out, gets a 200 with a malformed body, hits a rate limit on retry, and now has to decide whether to keep going or give up — all while staying inside a token budget and not corrupting anything downstream. The model doesn't solve that. The harness does.

Harness Engineering: The Code Around the Model Is the Hard Part

The harness is the product

When people say "we built an agent," they usually mean they wrote a prompt and a tool schema. That's the easy 20%. The other 80% is the scaffolding that decides when to call the model, what to put in front of it, whether to trust what comes back, and what to do when something fails. That scaffolding is the harness, and it's where your engineering judgment lives.

The useful mental model: the LLM is a single, expensive, non-deterministic function call. Everything that makes that call safe, bounded, observable, and repeatable is your code. Treat the model as a component you don't control and the harness as the system you do, and most architecture decisions get clearer.

Anatomy of a harness

Strip away the framework branding and every agent harness has the same moving parts:

  • A control loop that runs steps until the task is done, a stop condition fires, or a budget is exhausted.
  • A context manager that assembles the prompt each step — system instructions, relevant history, tool specs — and decides what to drop when it won't fit.
  • A model call wrapped in its own timeout and retry policy.
  • A parse-and-validate stage that turns model output into a typed, checked action before anything acts on it.
  • A tool dispatcher that executes the chosen action with its own timeouts, retries, and idempotency handling.
  • Guardrails that gate side effects — allow-lists, argument validation, rate limits.
  • Observability that records every step as structured data.

Frameworks give you defaults for these. The defaults are fine for prototypes and quietly wrong for production, because the right policy is domain-specific. How many steps before you bail? What's a retryable tool error versus a fatal one? What do you drop from context first? Nobody can answer those for you.

Tool calls are an untrusted boundary

The single most common production failure I see is treating model output as if it were already valid. The model proposes a tool call; the harness executes it verbatim. Then one day the model emits an argument that's subtly out of range, or invents a tool name, or returns JSON with a trailing comment, and the dispatcher happily forwards garbage into a system that does real things.

A tool call from the model is a proposal, not an instruction. Validate it like input from an untrusted client, because that's exactly what it is.

def step(state: AgentState, tools: dict[str, Tool]) -> StepResult:
    # 1. Assemble context within budget — drop oldest observations first
    prompt = state.context.render(token_budget=state.remaining_tokens())

    # 2. Model call is fallible: its own timeout + bounded retry
    completion = call_model(prompt, timeout_s=30, max_retries=2)
    state.spend(completion.usage)

    proposal = completion.tool_call
    if proposal is None:
        return StepResult(done=True, answer=completion.text)

    # 3. Validate the proposal BEFORE anything acts on it
    tool = tools.get(proposal.name)
    if tool is None:
        # Don't crash — feed the error back so the model can recover
        state.context.add_observation(f"error: unknown tool '{proposal.name}'")
        return StepResult(done=False)

    try:
        args = tool.schema.validate(proposal.arguments)
    except ValidationError as e:
        state.context.add_observation(f"error: invalid args: {e}")
        return StepResult(done=False)

    # 4. Guardrail: side effects must pass policy
    if tool.has_side_effects and not policy.allows(tool, args, state):
        state.context.add_observation("error: action blocked by policy")
        return StepResult(done=False)

    # 5. Dispatch with the tool's own failure handling
    observation = dispatch(tool, args, timeout_s=tool.timeout, retries=tool.retries)
    state.context.add_observation(observation)

    trace.emit(step=state.step_no, tool=tool.name, usage=completion.usage,
               latency_ms=observation.latency_ms, outcome=observation.status)
    return StepResult(done=False)
Enter fullscreen mode Exit fullscreen mode

Notice what the failure paths do: they don't raise. A bad tool name, invalid arguments, or a blocked action all become observations fed back into context. The model gets to see its mistake and try again. This single pattern — turning harness-level errors into model-visible feedback — is the difference between an agent that recovers and one that dies on the first imperfect output.

Context is a budget, not a buffer

The naive harness appends everything to a growing transcript and passes it back every step. This works until it doesn't: you blow the context window, latency climbs with every step, cost grows quadratically over a long task, and the model's attention degrades as the relevant signal drowns in old tool dumps.

Context is a budget you spend deliberately each step. That means making active decisions: which prior observations still matter, which can be summarized, which can be dropped entirely. A 40KB API response that mattered three steps ago is now dead weight — keep a one-line summary of what it told you and discard the body. The control loop's job isn't to remember everything; it's to keep the useful state in front of the model and evict the rest. Get this wrong and a task that should take eight steps either runs out of window at step twelve or costs five times what it should.

Plan for failure, because it's the default

In a system where one step is a network call to a probabilistic model and the next is a network call to a flaky third-party API, failure isn't the exception — it's the steady state. The harness has to assume every external call can time out, return malformed data, or partially succeed.

The parts that earn their keep here are unglamorous: timeouts on every external call (model included), bounded retries with backoff, idempotency keys on any tool that mutates state so a retry doesn't double-charge or double-send, and a hard step ceiling so a confused agent can't loop forever burning tokens. None of this is novel — it's the same distributed-systems discipline we've applied for two decades. What's new is that one of the unreliable components is now the decision-maker itself, which means a retry can produce a different decision. Your harness has to be correct under that, not just under transient errors.

You can't fix what you can't see

A non-deterministic system you can't replay is a system you can't debug. When an agent does something wrong in production — picks the wrong tool, loops, gives up early — "it worked on my machine" is meaningless, because your machine got a different sample.

So every step has to emit structured data: the assembled context, the model's decision, the tool called, the arguments, latency, token usage, and outcome. Not log lines you grep — structured spans you can query, aggregate, and replay. With that, "the agent failed" becomes "at step 7 it called the search tool with an empty query because the previous observation got evicted from context," which is an actual bug with an actual fix. Without it, you're tuning prompts by superstition. Token and cost accounting belong in the same trace, because on a long-running agent they're a production concern, not a billing footnote.

The takeaway

The model gets the headlines and the harness gets the pager. As base models keep improving, the differentiator between an agent that demos well and one that survives contact with production won't be which model you picked — it'll be the engineering quality of the code wrapped around it: how it validates, how it budgets context, how it fails, and how observable it is.

So here's what I keep coming back to: if you swapped your agent's underlying model tomorrow, how much of your reliability would survive the change — and how much was the harness carrying all along?


Runnable, tested example: https://github.com/mturac/harness-demo

Top comments (0)