Discussion on: AI Harness Engineering: The Missing Layer Behind Reliable LLM Applications

View post

"Harness engineering" is a sharper name for this layer than what I usually hear ("scaffolding," "orchestration") and I think it sticks because it implies load-bearing rather than decorative.

One thing I'd add from running production agent stacks: the harness work that pays the most isn't on the input side (prompt assembly, RAG retrieval) — it's on the output verification side. A schema-validated JSON output with retry-on-violation, a tool-call whitelist, and a cheap "did this answer the question?" classifier in front of the user catches roughly an order of magnitude more failures than tuning the prompt further.

The dirty secret is that swapping providers (Claude ↔ GPT ↔ Gemini) on a well-built harness is mostly a config diff; swapping providers on a bare API call is a rewrite. That's the real reason model convergence makes the harness the moat — portability is the product feature.

Question for you: how are you handling eval regressions when you upgrade the underlying model? That's the part of harness engineering I've seen teams underinvest in until it bites them.