Harness Engineering: The Next Evolution of AI Engineering
There's a quiet but significant shift happening in how engineers work with AI. Most people are still talking about prompt engineering. Some have moved on to context engineering. But the frontier right now is something deeper: harness engineering.
And it changes not just how we build software — it changes what skills actually matter.
The Three Eras
Era 1: Prompt Engineering
This is where most engineers started. You craft the right words, the right instructions, the right examples — and the model gives you a better output.
It works. But it's fundamentally a single-turn, stateless interaction. You're still doing all the orchestration in your head.
Era 2: Context Engineering
The next step was realizing the words mattered less than the information. What does the model know when it answers? What docs, history, retrieved data, and memory are in the window?
RAG pipelines, memory systems, and knowledge bases all belong here. You're no longer just crafting prompts — you're curating what the model sees.
Era 3: Harness Engineering
This is the current frontier. Instead of controlling what the model says or sees, you design the system the model operates within.
The model becomes a component — a reasoning engine inside a larger loop. You define the skills it can use, the tools it can call, the verifiers that check its work, and the conditions under which it loops, escalates, or stops.
The shift: you're no longer writing prompts. You're writing programs — but instead of functions and libraries, the primitives are skills, tools, and MCP servers.
What a Harness Actually Looks Like
The core pattern is simple:
Skill executes → produces output → verifier judges output → loop back or advance
Each step takes structured input, runs one or more skills (a model call, a tool, an API), produces output, and hands it to a verifier. The verifier decides: good enough to move forward, or retry with new context?
The orchestrator above it all manages state, tracks history across iterations, and knows when to escalate to a human instead of looping forever.
This isn't metaphorically like a program. It structurally is one — just written in skills and tools instead of code.
A Real-World Example: Autonomous Microservice Debugging
Let me share something I actually built — a simple version of this loop in practice.
I was troubleshooting an ECS microservice that kept failing after deployment. The usual process: check the GHA pipeline, look at ECS task status, dig through Cloudwatch logs, try a fix, redeploy, repeat. Tedious, manual, and slow — especially when the failure only surfaces after a full deployment cycle.
So I wired up a harness scoped entirely within the microservice itself:
- GitHub MCP — check the GHA pipeline run, read failed step output, create branches, commit fixes
- AWS MCP — inspect the ECS cluster, service status, and task health
- Cloudwatch — pull the ECS service logs, filter errors, surface stack traces
The loop looked like this: check GHA → check ECS → read logs → identify the issue → fix the code → commit and push → watch the next deployment → check logs again. Repeat until the service stabilized with no errors.
No manual SSH. No tab-switching between consoles. The harness held the full debug context across iterations — it knew what had already been tried — and kept tightening the loop until it was clean.
It's a narrow scope deliberately. One service, one environment, three MCP servers. But even this simple version saved hours of back-and-forth debugging and eliminated the cognitive load of tracking state across a long troubleshooting session.
This is the entry point for harness engineering in practice. Start with one service, one loop, a few well-defined skills. The pattern scales from there.
The full technical design — including the more complete vision with code edits, multi-service topology, and deployment gates — is in the appendix at the end of this article.
The Hard Problem: AI Doesn't Have the Whole Picture
My single-service harness worked well — within its scope. But that experience made the next problem obvious.
In real production systems, a microservice is never truly isolated. Every service I've worked with has upstream callers, downstream dependencies, and a surrounding ecosystem — PostgreSQL, Redis, SQS, Lambda workers, other microservices — all of which can cause your service to fail even when your service's code is perfectly fine.
I've seen this pattern more times than I can count. The symptoms show up in service A. Everyone debugs service A. Hours later someone notices that service B stopped consuming from the SQS queue two hours ago, which caused service A's queue depth to spike, which caused the memory pressure that looked like a code bug. The root cause was three hops away.
A harness that only knows about one service will do exactly what a junior engineer does: fix symptoms confidently while the real cause sits untouched elsewhere.
The full picture requires the harness to know the topology — what lives upstream and downstream, what ecosystem components exist, and which skill to use to check each one. Before diagnosing anything, it sweeps the entire dependency graph in parallel, accumulates findings from every node, and only then reasons about root cause.
That sweep might involve:
- GitHub and GHA — did the deployment itself introduce the issue?
- ECS tasks across multiple services — is something upstream unhealthy?
- Cloudwatch logs across service boundaries — where did errors first appear?
- PostgreSQL — connection pool exhaustion, slow queries, blocking locks
- Redis — memory pressure, eviction policy changes, connection refusals
- SQS — queue depth, dead-letter queue size, consumer lag
- Lambda — throttling, cold start storms, downstream retry cascades
This is what a senior engineer does instinctively when they get paged. They don't open the failing service first — they open a mental map of everything connected to it and start ruling things out. The harness needs the same instinct, but it has to be given the map explicitly.
Building that map, keeping it accurate as the system evolves, and knowing what to include — that's not a technical problem. It's a judgment problem. And it's entirely on the engineer, not the harness.
The Boundary That Must Stay Human
One more constraint that comes directly from experience.
The harness I built operates only in lower environments, on feature branches. It checks GHA, it inspects ECS, it reads logs, it proposes and applies fixes. But it never merges to main. It never touches production. When it's satisfied that the fix holds, it opens a PR with the full debug history attached — and stops.
A human reads it, reviews the diff, and decides whether it goes forward.
This isn't just a safety rule. It reflects something real about where AI judgment currently breaks down. The harness is excellent at iteration within a defined scope — it holds state, tries things systematically, doesn't get tired. But it has no awareness of the things that make a production decision hard: what other teams are deploying this week, whether there's a compliance review pending, what the blast radius looks like at 3am on a Friday, whether the business can absorb a rollback if something goes wrong.
Those calls require context that lives outside the codebase. That context lives with people.
The machine does the iteration. The human makes the promotion decision. That division is the right design — not a temporary limitation to be engineered away.
Coding is Cheap. Engineering is Not.
AI is replacing most of the coding work. It is not replacing the engineering judgment.
Understanding infrastructure as a whole system — how failure propagates, where the real blast radius sits, what the topology actually looks like versus what the documentation says — that knowledge is becoming the scarce resource. Not syntax. Not boilerplate. Not even algorithms.
The engineers who thrive in this era are the ones who can hand a well-designed harness a well-defined problem, watch what it does, and know exactly when its confidence is outrunning its understanding. That's a harder skill to develop than writing code. And it's much harder to automate.
Appendix: Technical Design of the Microservice Debugging Harness
Skills
| Skill | MCP / Tools | Purpose |
|---|---|---|
| GitHub Skill | GitHub MCP | Branch management, PR creation, GHA pipeline monitoring |
| AWS Skill | AWS MCP | ECS cluster, service, and task health verification |
| Cloudwatch Skill | AWS MCP | Log retrieval, error filtering, stack trace parsing |
| PostgreSQL Skill | Postgres MCP | Slow query analysis, connection pool status, schema verification |
| SQS Skill | AWS MCP | Queue depth, DLQ size, consumer lag |
| Redis Skill | AWS MCP | Memory usage, eviction rate, connection count |
| Lambda Skill | AWS MCP | Error rate, throttle count, duration, cold starts |
| HTTP Health Skill | HTTP tool | Upstream and downstream service health endpoints |
| Code Reader Skill | GitHub MCP | Fetch source files relevant to the error |
| Code Editor Skill | File edit + GitHub MCP | Apply fix to source code |
| Commit/Push Skill | GitHub MCP | Version the change on feature branch |
| GHA Watcher Skill | GHA MCP | Poll pipeline run, read failure logs |
| Deploy Waiter Skill | AWS MCP | Wait for ECS task stabilization after rollout |
| Load Test Skill | HTTP / Playwright | Trigger load and UI click flows against lower env |
Execution Flow
Main Orchestrator (loop until healthy)
│
├── Phase 1: Full topology sweep (parallel)
│ ├── upstream health check
│ ├── self: ECS tasks, Cloudwatch errors, GHA deploy status
│ └── downstream: Postgres, Redis, SQS, Lambda, HTTP health
│
├── Reasoning: model reads combined findings, identifies root cause
│ └── decides: infra fix OR code fix
│
├── Action
│ ├── infra fix: AWS Skill → update ECS task definition, env vars
│ └── code fix: read source → edit → commit → push → watch GHA → wait for ECS
│
├── Verify Phase
│ ├── ECS tasks stable?
│ ├── Cloudwatch: error rate below threshold?
│ └── Postgres: no blocking queries?
│
├── Test Phase
│ └── Load test + UI test against lower env
│ ├── pass → open PR with debug summary → DONE
│ └── fail → append findings to debug context → loop back
│
└── Escalation condition
└── if iterations > N → surface findings, open PR, pause for human
Shared Debug Context Object
Each iteration appends a full record so the model never repeats a fix that already failed:
{
"service": "payments-api",
"iteration": 3,
"history": [
{
"iteration": 1,
"diagnosis": "OOMKilled - exit code 137",
"action": "infra fix - increased ECS memory to 2048",
"result": "fail - still OOMKilled at 2048"
},
{
"iteration": 2,
"diagnosis": "memory leak in batch processor",
"action": "code fix - reduced batch size 1000 → 100",
"commit": "a3f9c12",
"gha": "pass",
"result": "fail - new error: DB connection timeout"
},
{
"iteration": 3,
"diagnosis": "connection pool exhausted after batch fix",
"action": "code fix - added pg connection pool limit",
"commit": "b7e2d45",
"gha": "pass",
"result": "pending"
}
]
}
Service Topology Map
{
"service": "payments-api",
"upstream": [
{"name": "api-gateway", "type": "http", "skill": "http-health"},
{"name": "frontend-app", "type": "http", "skill": "http-health"}
],
"downstream": [
{"name": "postgresql", "type": "db", "skill": "postgres"},
{"name": "redis", "type": "cache", "skill": "aws-elasticache"},
{"name": "sqs-payments", "type": "queue", "skill": "aws-sqs"},
{"name": "lambda-worker", "type": "compute", "skill": "aws-lambda"},
{"name": "notification-svc", "type": "http", "skill": "http-health"}
]
}
Human Gates
| Gate | Condition |
|---|---|
| PR review and merge | Always — harness opens PR, human approves |
| Production deployment | Always — human driven |
| DB schema changes | Require explicit approval before harness proceeds |
| Iteration escalation | If harness exceeds N iterations with no progress |
Rex Zhen is a Senior Site Reliability Engineer specializing in Cloud Infrastructure & AI/ML. Follow him on LinkedIn for more on cloud architecture, SRE, and the evolving role of AI in engineering.
Top comments (0)