Building a Harness: From Prototype to Production

#ai #agents #llm

TL;DR

An agent doesn’t truly work because of the model, but because of the harness controlling it.
Moving from demo to production requires handling errors, state, memory, and observability.
A well-designed harness reduces model unpredictability and shifts complexity into code, making the system reliable and usable in real-world scenarios.

In article Harness Engineering: The Most Important Part of AI Agents we saw a fundamental point: the problem with agents isn't (only) the model, but the system around it.

But what does it really mean to build that system?

The moment everything breaks

There's a fairly universal phase: you've implemented a demo, it works well, the model responds, uses a tool, maybe even completes multi-step tasks, and everything looks promising.

Then you try to use it in a real-world context, and the problems emerge:

invalid outputs
incorrect API calls
infinite loops
loss of context

It's not that the model got worse—the system's complexity increased without having a harness solid enough to manage it.

The harness as a control system

It becomes clear that the harness isn't just a "container"—it's more like a control system designed to guide the model along a precise path, reducing its freedom when necessary and allowing it when useful.

This is a delicate balance: too much control means loss of flexibility; too little control means loss of reliability.

And this is where the real design work begins.

Error handling becomes the main case

In traditional software, errors are edge cases. Anyone with experience in agent-based systems knows that errors are the norm.

The key idea, however, is that a well-designed harness does not assume everything will go well—quite the opposite.

It therefore introduces mechanisms such as:

validating outputs before using them
retrying when something goes wrong
falling back to alternative paths
controlled interruption of loops

This is what makes the system usable.

State and memory: the invisible problem

Another issue that emerges very early is state management: an agent without memory is little more than a stateless function—but adding memory introduces complexity:

what to store
for how long
how to update the state
what happens when it becomes inconsistent

These decisions must be made when structuring the harness.

And it's precisely here that many subtle bugs tend to arise.

Observability: knowing what's happening

When something goes wrong (and sooner or later it will), the important question is:

"Can I understand what happened?"

Without logging and tracing, working with agents becomes almost impossible.

Because you need to see:

every step of the reasoning
every tool call
every output transformation

And not just for debugging, but to evolve the system.

Moving complexity to the right place

An interesting aspect is that, as you improve the harness, the system becomes more predictable—even without changing the model.

This happens because complexity is being moved out of an "opaque" component (the model) and into code that can actually be controlled.

It's a shift in strategy:

less blind trust in the model
more explicit control in the system

Which, ultimately, is software engineering.

In fact, we can say that building agents today is much closer to traditional software engineering than it might seem.

There are flows, states, error handling, integrations, observability…

The only difference is that instead of deterministic functions, there's a probabilistic model.

The harness is what holds everything together—and that's what makes the difference between something that only works in a demo and something that truly works in production.