$ agent run --task "reconcile yesterday invoices"
ok: 412 matched
ok: finished in 1m 12s
# next morning, finance reran the whole batch by hand
The run said ok.
The work was wrong.
If you have shipped an agent into real operations, you know this exact gap.
The logs are green. The outcome is not.
So you do what almost everyone does.
You open the system prompt and start rewording it.
You add a rule. You add an example. You tighten the instruction.
It works in your test that afternoon.
It breaks again the following Tuesday.
I want to name the reason, because it has cost teams I work with entire weeks.
The reflex that points at the wrong layer
When an agent does the wrong thing in production, the instinct is to fix the prompt.
That instinct is usually aimed one layer too high.
Prompt engineering and harness engineering are two different disciplines.
They fail in ways that look identical from the outside, and they live in completely different parts of your system.
Most teams practice one of them and quietly assume it covers both.
What each one actually is
Prompt engineering is what you say to the model.
The instruction, the role, the examples, the shape you ask for back. It is a strategy choice. It decides how the model reasons about a request.
Harness engineering is everything around the call.
What context reaches the model. What tools it is allowed to touch. What happens to its output before anything downstream acts on it. What fires when it returns nonsense. What state it carries between steps. What the loop does when a tool times out or a credential rotated overnight.
Prompts are a strategy choice.
The harness is the operating system that strategy runs on.
The sharpest prompt inside a careless harness is a sports car with no brakes.
Why the two get confused
Both failures show up as the same sentence in your incident channel.
"The agent did the wrong thing."
That sentence hides which layer broke. A prompt failure and a harness failure read identically to the person filing the ticket.
So teams reach for the layer they know how to touch, which is the prompt, and they tune it.
The tuning sometimes moves the symptom. It rarely fixes a harness bug, because the harness bug was never about what the model was told.
How to tell which layer you are actually in
You do not need my tooling to run this split. You need the questions.
Hand the model the right inputs by hand and watch what it returns.
If it produces a wrong answer from clean inputs, you are in prompt territory. The instruction or the examples are off.
If it is fine in isolation but the running system still does the wrong thing, you are in harness territory. The right answer reached the wrong place, or a wrong input reached a model that had no way to know it was wrong.
Then ask what changed between the green build and the first red incident.
If the honest answer is a load shape, a timing dependency, or a credential that rotated, the prompt was never the problem. The harness did not expect reality.
That third class is the expensive one.
It does not show up in dev. It shows up at week two of production, after the first real load passes through. That delay is exactly why teams misread it as a model problem.
What changes when you treat the harness as its own discipline
The question you ask stops being "what should I tell the model."
It becomes "what does the model need to see, what is it allowed to touch, and what checks its output before anything trusts it."
That is a different muscle. It sits closer to systems engineering than to writing.
It is also where the reliability your operations team actually feels comes from.
The prompt decides whether the answer is good.
The harness decides whether a bad answer can do damage.
In enterprise automation, the second one is what keeps the Monday dashboard honest.
What this writeup does not hand you
I run a working version of this across DACH automation deployments.
The instrumentation, the thresholds, and the runbook that maps a failure class to a first response are the work I bring to client engagements.
I am not pasting them here, and the reason is honest.
If the implementation is a copy-paste away, the next team to hit this finds the snippet, ships it, and skips the conversation that surfaces the deeper problem.
The diagnosis is the part that carries value. The code is the easy half once you know which layer you are fixing.
The question I will leave you with
If you run agents in anything that touches revenue or operations, you have hit one of these already.
So tell me which side your last bad incident lived on.
Did the model give a wrong answer, or did a right answer land in the wrong place?
Drop the shape of it in the comments and I will tell you which layer I would look at first.


Top comments (0)