Paradith

Posted on Jun 8

The Harness Is Now the Product

#ai #programming #webdev #software

Has AI engineering just changed (again)?

For the past year, most of us building with AI have been focused on one thing: getting better outputs from models.

We tweaked prompts.
We experimented with temperature.
We added examples, instructions, formatting tricks. I mean my god I have never seen so many copy/pasta prompting articles in my life.

And for a while, that worked. But recently, something has shifted.Across very different companies like LangChain, Anthropic, Cognition, Arena, Hugging Face... the same pattern is emerging.The real engineering work is no longer in the prompt.It has moved up the stack into what surrounds the model. That “something” is what many are now calling 'the harness'. Despite the name it is not a sequel to The Substance...(or is this an early warning sign movie buffs?).

What the harness actually is
At its core, the harness is not a library or a framework. It’s a control loop.Instead of treating the model as a one-shot function, the harness wraps it in a system that continuously manages its behaviour. A typical harness includes:

Task planning (what are we trying to do?)
Context assembly (what does the model need to know?)
Execution (call the model or tools)
Evaluation (was the result good enough?)
Safety and validation (is it allowed and correct?)
Observability (what happened, step by step?)

This is the key difference, the output is no longer just text. It’s also traces, metrics, and outcomes. The system doesn’t just produce an answer. It learns whether that answer was acceptable, and what to do if it wasn’t.

Why this changes how we build

What’s happening here is subtle but important.
We’re no longer building AI features.We’re building AI systems that manage uncertainty. And that requires a different mindset. In traditional software, we assume determinism. If the code is correct, it behaves consistently.

With LLMs, that assumption breaks. The same input can produce different outputs. The model can hallucinate. It can partially complete tasks. It can fail in ways that look plausible.The harness exists to deal with that reality. It doesn’t try to make the model perfect.It builds a system that makes imperfect components reliable.

The clearest way to understand this change is we are moving from generating answers to controlling outcomes.

In the prompt era, success meant “Did the model give a good answer this time?”

In the harness era, success means “Does this system consistently produce good answers across many cases?”

That’s a fundamentally different problem. And it’s one software engineers are already very good at solving, just in other domains.

What engineers need to do differently
The shift to the harness isn’t about learning new tricks. It’s about applying familiar engineering discipline to a new kind of system.
First, we need to stop thinking in single prompts and start thinking in loops. Instead of asking once and accepting whatever comes back, we design flows where the system can evaluate its own output and try again if needed. Patterns like “generate, critique, refine” or “plan, execute, evaluate” become the norm.

Second, evaluation becomes a first-class concern.In traditional systems, tests validate behaviour. In AI systems, evaluations define behaviour. Without a way to measure output quality, you don’t really know what your system is doing. This means building datasets, defining scoring rules, and running continuous checks—not as an afterthought, but as part of the core system.

Third, observability becomes essential. When something goes wrong, you need to see:

what the model was given
what it produced
what decisions the system made

Execution traces in AI systems play the same role that logs and stack traces play in distributed systems. Without them, debugging is guesswork.

Fourth, we need to embrace non-determinism instead of fighting it.
Rather than relying on a single output, harness-based systems often generate multiple candidates and then select the best one using scoring, heuristics, or even another model. It’s less like calling a function and more like running a small experiment and choosing the best result.

Finally, we design for failure from the start. We assume outputs may be wrong, tools may be misused and steps may be incomplete.So we add validation, fallbacks, retries, and sometimes human intervention.

A familiar pattern in disguise...If this all feels familiar, it should.What we’re building now looks a lot like distributed systems with retries and circuit breakers - data pipelines with validation and monitoring AND search systems with ranking and evaluation.

The difference is that the core computation the model is probabilistic. The harness brings it back into the world of engineered reliability.

I think the reason so many companies are independently arriving at the same idea is that eventually, everyone hits the same limits.Prompts stop scaling.Complex pipelines become fragile and the cost of unreliability becomes real. The harness is the natural response to those constraints because it gives you feedback loops, measurable quality and controllable behaviour (they hope).

In other words, it lets you build something you can trust.

If there is one idea to hold onto, the model is not your product.
The system around the model is your product. And just like in traditional computing, the real innovation and the real differentiation happens in that layer above.

I kinda love that we’re finally moving out of the phase where AI feels like magic and into a phase where it starts to feel like engineering again.The teams that succeed won’t be the ones with the cleverest prompts. They’ll be the ones who build strong feedback loops, reliable evaluation systems, observable pipelines and robust harnesses.Because in the end, users don’t care how impressive a single response is.

They care whether the system works.

Top comments (2)

Mallory Haigh • Jun 10

The pattern you're talking about maps almost exactly onto what platform engineers have been building in the agent infrastructure layer: identity, capability boundaries, execution, evaluation, observability, safety gates, etc. The "harness" is less a new concept, and more a name for the foundational substrate that makes agents reliable at scale.

The important next thing to think about is where that infrastructure lives organizationally. Right now, most teams are building it per-project, which means every AI system gets its own bespoke control loop, its own eval approach, its own observability setup, etc. That's the prompt-era mindset applied one level up the stack, and I continually see it produce the same fragmentation!

Paradith • Jun 15

The organisational point you raise is really interesting. The per-project “bespoke control loop” pattern feels like a direct carryover from the prompt era, and I’m seeing the same fragmentation you describe especially around evals and observability. It feels like there’s a gap right now where this should be converging into a shared platform layer, but most orgs haven’t quite made that shift yet. Curious if you’ve seen examples of teams getting that right (I have not seen any so far - and that is ok as we are all still learning :) )