There's a particular kind of dismissal I've gotten used to. A typical response to the idea of harness engineering being a legitimate field of engineering, and not just a matter of typing the right words.
The framing is wrong, and I want to say why.
What engineering actually is
Engineering isn't defined by its substrate. It's not what you do to silicon, or to bridges, or to pipelines. Engineering is the disciplined application of the scientific method to building things that work. You form a hypothesis about the world, you make a change to test it, you observe the result, and you keep what survived and discard what didn't. Do that long enough on the same artifact and you end up with something that embodies a stack of accepted theories about how its inputs map to its outputs.
That description fits what I do on the harness exactly.
The loop
I have a system prompt. I have a set of tools. I have a corpus of inputs, rules, skills, and so on. I have outputs.
When I make a change to the harness, I am not vibing. I am running an experiment.
The hypothesis is specific. If I rewrite this section of the system prompt to tell the model to think before answering on math problems, then on this representative set of math inputs, the outputs will be more often correct and less often confidently wrong. That's a falsifiable claim about how a particular input distribution maps to a particular output distribution under a particular intervention.
The experiment is the change itself. I make it, I run it, I look at what comes out.
The evaluation is where the objections show up loudest, so let me address them directly.
"But 'good' is subjective"
Yes. So is "fast enough," "clean enough," "safe enough," "maintainable enough," and every other criterion any engineer in any discipline has ever optimized against. A bridge engineer decides what acceptable deflection is. A backend engineer decides what acceptable p99 latency is. Those numbers come out of judgment, taste, and context. They aren't handed down from physics.
What makes the judgment engineering rather than vibes is that I commit to it before I run the experiment, I apply it the same way across the comparison, and I'm willing to be wrong about the change. When I look at the new outputs and they're worse by my own stated criteria, I revert. That's the falsifiability the scientific method needs. Not an objective view of correctness, but a willingness to accept results that contradict the hypothesis.
Honestly, the evaluation here is easier than the abstract worry suggests. The harness itself codifies what good output looks like: the standards for code, the expected structure, the behavioral rules. So evaluation collapses to a clean question: did this input produce code that adheres to the standards the harness defines? I'm not reaching for a vague aesthetic. I'm checking outputs against criteria I've already committed to in writing, in the very artifact I'm modifying.
Codification
The deepest reason this is engineering, though, is what happens to the harness over time.
Every change that survives is a lesson made permanent. The current state of the harness is a fossil record of every hypothesis I confirmed: every failure mode I noticed, every input class I learned to handle, every interaction between tools that I figured out the hard way. It is also a fossil record of restraint. Every change I tried and threw away, every clever idea that turned out to make things worse, is not in there. That negative space is information too. A harness that only grows is a harness whose author wasn't doing science.
When someone reads my system prompt, they're reading the conclusions of hundreds of small experiments. The fact that those conclusions aren't presented as a research paper doesn't make them less real. They're load-bearing. The harness works because of them.
Where this leaves us
The substrate is new. The outputs are stochastic. The evaluation involves judgment. None of that means it isn't engineering. It just means it's engineering on an unfamiliar material, and people are pattern-matching on what engineering looks like instead of what it is.
Hypothesis. Experiment. Evaluation. Codify or discard.
That's the loop. I'm in it every day. Call it what you want. I'll keep doing the work.
Top comments (0)