I Didn't Know I Was Doing Harness Engineering

#ai #machinelearning #devops #programming

In February 2026, Mitchell Hashimoto (co-founder of HashiCorp) described his habit of engineering permanent fixes into an AI agent's environment whenever it made a mistake. He called it "engineering the harness." Days later, OpenAI formalized the concept in a blog post. Around the same time, without having read either, I wrote my first enforcement hook for a production AI system. Different continent, different scale, different context. Same problem.

A few weeks later, Birgitta Bockeler formalized it on Martin Fowler's site. Red Hat published their version. LangChain. Salesforce. By April, the term was everywhere.

I didn't discover any of this until recently. I was too busy building the thing they were naming.

That's not a flex. It's something more interesting. When engineers face the same constraints (unreliable model outputs, production stakes, context that evaporates), they converge on the same solutions. Different trails, same summit. And if your messy pile of rules and scripts looks suspiciously like what OpenAI and Fowler describe, that's not coincidence. It's validation.

What Is Harness Engineering (And Why It Matters for AI Agents)

Harness engineering is the discipline of building the constraints, gates, memory systems, and feedback loops that wrap around an AI agent to make it reliable in production. The core equation, from Martin Fowler's team: Agent = Model + Harness. The harness is everything around the model that you actually control.

Red Hat puts it differently. "The AI writes better code when you design the environment it works in." Their framing is about structured workflows. Templates. Impact maps. Acceptance criteria.

Both are right. Neither is complete.

They describe the architecture. They don't describe the pain that forces you to build it.

How My Harness Grew (Without Me Realizing What It Was)

I run a production AI system as a daily driver. Not a demo. Not a proof of concept. A system that manages infrastructure, writes code, deploys to servers, interacts with APIs, and handles real stakes across real projects. I co-founded Aether Global Technology, a Salesforce consulting partner in Manila. The system runs alongside that work.

I never sat down and said "I'm going to build a harness." I just kept getting burned, and kept adding rules so I wouldn't get burned the same way twice. Looking back, every rule traces to a specific failure.

The anti-fabrication rules exist because the AI confidently stated a method existed in a file it hadn't read. I spent 45 minutes debugging code that was never there. The fix wasn't better prompting. It was a mechanical gate: before asserting any method name or file path, the system must verify via tool. No verification, no assertion. That's a feedforward control, in Fowler's language. I just called it "stop making things up."

The deploy gate exists because the system nearly pushed Salesforce metadata to the wrong sandbox. 54 files, wrong org. The fix was a target allowlist per project, checked mechanically before any deploy command executes. A hard block, not a polite suggestion. (Sound familiar? An AI agent deleted a production database in 9 seconds because nobody built one of these.)

The anti-drift rules exist because after multiple tool calls, the system's mental model of a file diverges from the file's actual state. It recalls values it read 20 minutes ago, not the values that exist now. The fix: re-read the source before emitting anything external-facing. Grep at write time, not recall time.

The citation requirement exists because the system generated a client proposal with a number it pulled from nowhere. In consulting, a wrong number in front of a client is a credibility hit you don't recover from. The rule is simple now: every data claim needs a source. No source, mark it as unverified. No exceptions.

None of these came from reading a framework. They came from things going wrong on a Tuesday afternoon.

What Fowler Gets Right

The dual-control model is real. You need both feedforward controls (rules that prevent bad behavior before it happens) and feedback controls (sensors that catch it after). Relying on just one creates blind spots.

My system has 40+ feedforward hooks. They fire before tool calls, checking for unauthorized domains, verifying pre-task knowledge checks happened, blocking destructive git operations, enforcing deploy targets. The same problems I wrote about in what autonomous agents actually cost in production.

The feedback side is thinner. I have post-execution checks and monitoring, but the honest truth is that feedforward controls do most of the heavy lifting. Catching a bad action before it executes is cheaper than cleaning up after it runs.

Fowler also nails the distinction between computational and inferential controls. My deploy gate is computational. It checks a JSON allowlist. Takes milliseconds. My anti-fabrication system is inferential. It relies on the model itself to flag uncertainty. That's slower, less reliable, and more expensive. But it catches things no deterministic check can.

What the Frameworks Miss

Harnesses are incident-driven, not architecture-driven. The literature treats harness engineering as a design discipline. It is, eventually. But every harness I've seen starts as a pile of duct tape applied after something broke. The elegance comes later.

Context survival is the real engineering problem. Nobody talks about this enough. AI agents operate in conversation windows. Those windows compress. When they compress, the agent forgets rules, loses project state, and starts making the same mistakes you fixed three hours ago. My harness has a dedicated recovery protocol: when context compresses, reload memory, re-read project state, verify the date (the agent doesn't know what day it is after compression). That's not in any of the frameworks. It should be.

The harness is the product, not the model. When people evaluate AI systems, they compare models. Claude vs. GPT vs. Gemini. That's the wrong comparison. The model is interchangeable. I've run the same harness across model versions, and the harness determines output quality more than the model does. A disciplined harness on a weaker model beats an unconstrained stronger model every time.

Human checkpoints aren't optional. Red Hat says "human review between planning and implementation." That's correct but undersells it. In my system, any task with three or more steps requires a plan review before execution. Single-step tasks state the intended action and wait. This isn't a nice-to-have. It's the difference between an AI agent that helps and one that creates work.

Same Summit, Different Trails

Here's what I find encouraging about this whole thing.

My first hook was mid-February 2026. By March, I'd codified the principle "mechanical enforcement over behavioral commitment" because telling the model not to do something stopped working the moment context compressed. By April, I had 30+ hooks, a memory layer that survives compression, and a pre-task gate system that forces verification before every edit.

I built all of this without reading a single blog post about harness engineering. I built it because things kept breaking, and I was tired of fixing the same failures manually.

OpenAI, Fowler, Red Hat, LangChain, Salesforce. They all arrived at the same architecture from the enterprise side. I arrived from the practitioner side. A guy in Manila running one AI system across 40+ projects, duct-taping rules onto it every time something went wrong.

The fact that we converged tells you something important: this isn't a framework you adopt. It's a shape that production forces you into. If you're running an AI agent on real work and you've started writing rules, blocking certain commands, requiring verification steps before deploys, you're already doing harness engineering. You just didn't know it had a name.

The industry version is clean. Diagrams with boxes. Three regulation dimensions. Harness templates.

The practitioner's version is messier. A behavioral rules file that grew from 5 rules to 13 because the AI kept finding new ways to drift. A hook that blocks web searches because the AI was burning API calls on questions its own knowledge base could answer. A gate that forces the system to check what day it is before referencing time, because it hallucinated the date twice.

Both versions work. Both are valid. The diagram didn't exist when I needed a solution. The solution existed when the diagram caught up.

If you're building something like this and wondering whether you're doing it right, check it against Fowler's framework. If your scrappy infrastructure maps to their categories (guides, sensors, computational controls, inferential controls), you're on the right track. The problems are universal. The solutions are convergent. And you don't need permission from a blog post to keep building.

Originally published at tokita.online