eleonorarocchi

Posted on Apr 21 • Edited on May 11

Harness Engineering: The Most Important Part of AI Agents

#ai #agents #llm

TL;DR

LLMs don’t become agents because they’re more intelligent, but because we place them inside a system that makes them usable.
That system - which handles context, tools, errors, and flows - is the harness.
If an agent doesn’t work, the problem is most likely not the model, but everything you’ve built around it.

The agent isn’t broken. Your harness is.

In recent years, we’ve seen an impressive acceleration in the world of language models. Every month (every day!) something more powerful, more efficient, more “intelligent” comes out. And inevitably, the conversation always focuses there: which model to use, how many parameters it has, how well it performs on benchmarks.

But when you try to build something real, something interesting happens: the model stops being the main problem.

When you move from a demo to a system that actually has to work (with real users, messy data, unpredictable edge cases), you realize that the LLM alone isn’t enough. Not because it isn’t powerful enough, but because it isn’t designed to be reliable.

This is where what’s called harness engineering comes into play.

It’s not about the model, it’s about the system

There’s a concept that comes up often lately: agent = model + harness.

It sounds like a simplification, but it’s actually a very accurate description of what happens in practice.

The model generates text. The harness decides what that text means, what to do with it, when to trust it, and when not to.

It’s a subtle distinction, but it completely changes the way you design a system.

Because the moment you start building an agent, you are implicitly also building a way to manage context, call external tools, verify that the output makes sense, and recover when something goes wrong.

And none of that lives inside the model. It lives around the model.

The “strange” behavior of LLMs

Anyone who has worked even a little with these systems has already seen the problem.

Same prompt, same input, two different outputs.
Or: it works perfectly for ten requests, and then fails on something trivial.

That’s not a bug. It’s the nature of the model.

LLMs are not deterministic systems designed to be 100% reliable. They are excellent at generalizing, less so at guaranteeing consistency.

And this is where the developer’s role changes.

You’re no longer writing code that does things.
You’re building a system that manages an unreliable component.

And that system is the harness.

From prompt engineering to system design

For a long time, we treated these problems as an extension of prompt engineering.

“Let’s write a better prompt.”

That works, up to a point.

Then you start adding automatic retries, structured parsing, output validation, memory between steps.

And without realizing it, you’re no longer working on a prompt — you’re designing a system.

This is probably the most important transition: moving from thinking in terms of input/output to thinking in terms of flows, states, and controls.

The harness as a translator between model and reality

A useful way to think about the harness is as a translation layer.

On one side, the model: operating in natural language, probabilistic, flexible.

On the other side, the real world: APIs that break, incomplete data, rigid formats, irreversible actions.

The harness sits in between and acts as a mediator.

It takes something “soft” (the model’s text) and turns it into something “hard” (concrete actions).
And it also does the reverse: it takes structured signals and makes them usable for the model.

Why two agents using the same model behave differently

Let’s say we have two applications using the exact same LLM and getting completely different results.

At first, it seems strange.

But looking closer, you realize the difference isn’t in the model. It’s in everything around it:

How context is managed
When tools are called
What happens when something fails

In other words: the harness.

DEV Community