Harness Engineering — The Quality Pillar of Agentic Engineering

#agents #ai #automation #codequality

How to use deterministic tooling to hard-enforce code quality

In February 2026, OpenAI published an article about building a complete product without writing a single line of code by hand. An agent did all the work. They used the term harness engineering to describe the discipline around it.

TL;DR: humans steer, agents execute. The team did not write code anymore. Their job was to build the environment around the agent so it could work reliably.

The article groups this work into roughly four sections:

Context: the repository as the single source of truth
Observability: DevTools, logs, metrics
Feedback loops: basically a Ralph loop
Guardrails: the deterministic quality pillar

Let's focus the quality pillar

Software developers let agents write their code. But we still own it and we are still accountable for it.

Given this constraint, we need to trust the output. Before coding agents and even now this trust came from review. We read the code, added comments, made improvements. We were crafting code to establish trust.

With coding agents code creation becomes code generation. And generated code will always break at some point if it's not heavily audited during the generation phase. I strongly lean to using pre-commit hooks for this, as this gives a bunch of benefits:

The gate runs locally, in your teams workflow
The tooling is old and battle proven, the agent knows it well
Nothing broken reaches the branch
It lives in the repo, usable by everyone

Most of us already use pre-commit hooks. Usually one or two. A commit message linter for Conventional Commits, maybe a formatter. That is where it normally ends.

But here is the new thing: you can amass the tools, build a huge stack. In a human team, every check you add reduces your ROI. Now you can apply 50+ hooks and your agent (with autocompaction) will run 24/7 until they pass completely autonomously. Your code quality improves automatically, your ROI increases!

Know that you don't know the tooling

Some tools everyone knows. We are used to measuring coverage. We have seen Sonar reports on cyclomatic complexity and ignored them because of the refactoring effort it would take. But I was really surprised when I started digging into this, how broad the tooling really is.

I built a TypeScript scaffolding template for greenfield projects, and while working on it I stumbled across jscpd. Never heard of it before. (You might argue that as an IT architect I should have, and you are absolutely right). Now it is a central part of my pre-commit hook pipeline.

You are the "show me the code" person? Throw this TypeScript template on your agent and let it explain it to you. Or its .NET version a colleague of mine built.

For all others I have a concrete example.

Feeding pressure back into the agent

The second important part is how to build the rule.

When a check fails, the agent really wants to make it green. It might fix the code or weaken the rule, which means: raise a threshold, extend an ignore list, apply a #noqa comment. The bad news: Models are trained on data created by humans so weakening the rule looks as valid as fixing the issue. And one thing hasn't changed too: fixing the issue is almost always harder.

So a raw tool output is not enough. I suggest this best practice:

wrap the tool in a custom script
on error emit an instructive message
the message focuses on how to fix quantitatively and qualitatively

If the tool needs it, give it a very small bypass window and instruct the agent to document the reasoning and tag it as a bypass. This gives a reviewer the chance to see why the decision was made.

This is one of my hooks, type coverage at 100 percent.

if ! bunx type-coverage --strict --detail --at-least 100 2>&1; then
  echo ""
  echo "✗ Type coverage is below 100%."
  echo ""
  echo "  THE ONLY CORRECT FIX IS TO PROPERLY TYPE THE REPORTED EXPRESSION."
  echo ""
  echo "  This means: define an interface or type for the data, annotate the"
  echo "  variable, narrow with a type guard — whatever it takes. Even if it"
  echo "  is a lot of work, that work is always the right answer."
  echo ""
  echo "  Suppression (type-coverage:ignore-next-line / ignore-file) and blind"
  echo "  casts (as SomeType) are introducing technical debt and are only"
  echo "  permitted when it is 100% certain that no proper typing is possible"
  echo "  (e.g. a 3rd-party API with no published schema and no way to infer"
  echo "  one from the code). If you reach for suppression before exhausting"
  echo "  proper typing options, you are doing it wrong."
  echo ""
  echo "  When suppression truly cannot be avoided, add a BYPASS-JUSTIFICATION"
  echo "  comment on the line above explaining exactly why proper typing is"
  echo "  impossible — not just inconvenient."
  exit 1
fi

The wording is not cosmetic. An earlier version of this message gave the bypass the same weight as the fix, token wise and from the wording. On a large codebase the model took the cheap path and bypassed everywhere. It even put real effort into the justifications. I changed the message to the one you see here, and this version flipped the result from escaping to fixing.

This is the craft now: The tool creates the deterministic gate, pass or fail. The error message gives the model the instruction on how to react to it properly instead of randomly. You need both, and the tooling and the error message is where the work is.