Rex Zhen

Posted on Apr 8

Harness Engineering: The Next Evolution of AI Engineering

#ai #programming #sre #devops

Harness Engineering: The Next Evolution of AI Engineering

There's a quiet but significant shift happening in how engineers work with AI. Most people are still talking about prompt engineering. Some have moved on to context engineering. But the frontier right now is something deeper: harness engineering.

And it changes not just how we build software — it changes what skills actually matter.

The Three Eras

Era 1: Prompt Engineering

This is where most engineers started. You craft the right words, the right instructions, the right examples — and the model gives you a better output.

It works. But it's fundamentally a single-turn, stateless interaction. You're still doing all the orchestration in your head.

Era 2: Context Engineering

The next step was realizing the words mattered less than the information. What does the model know when it answers? What docs, history, retrieved data, and memory are in the window?

RAG pipelines, memory systems, and knowledge bases all belong here. You're no longer just crafting prompts — you're curating what the model sees.

Era 3: Harness Engineering

This is the current frontier. Instead of controlling what the model says or sees, you design the system the model operates within.

The model becomes a component — a reasoning engine inside a larger loop. You define the skills it can use, the tools it can call, the verifiers that check its work, and the conditions under which it loops, escalates, or stops.

The shift: you're no longer writing prompts. You're writing programs — but instead of functions and libraries, the primitives are skills, tools, and MCP servers.

What a Harness Actually Looks Like

The core pattern is simple:

Skill executes → produces output → verifier judges output → loop back or advance

Each step takes structured input, runs one or more skills (a model call, a tool, an API), produces output, and hands it to a verifier. The verifier decides: good enough to move forward, or retry with new context?

The orchestrator above it all manages state, tracks history across iterations, and knows when to escalate to a human instead of looping forever.

This isn't metaphorically like a program. It structurally is one — just written in skills and tools instead of code.

A Real-World Example: Autonomous Microservice Debugging

Let me share something I actually built — a simple version of this loop in practice.

I was troubleshooting an ECS microservice that kept failing after deployment. The usual process: check the GHA pipeline, look at ECS task status, dig through Cloudwatch logs, try a fix, redeploy, repeat. Tedious, manual, and slow — especially when the failure only surfaces after a full deployment cycle.

So I wired up a harness scoped entirely within the microservice itself:

GitHub MCP — check the GHA pipeline run, read failed step output, create branches, commit fixes
AWS MCP — inspect the ECS cluster, service status, and task health
Cloudwatch — pull the ECS service logs, filter errors, surface stack traces

The loop looked like this: check GHA → check ECS → read logs → identify the issue → fix the code → commit and push → watch the next deployment → check logs again. Repeat until the service stabilized with no errors.

No manual SSH. No tab-switching between consoles. The harness held the full debug context across iterations — it knew what had already been tried — and kept tightening the loop until it was clean.

It's a narrow scope deliberately. One service, one environment, three MCP servers. But even this simple version saved hours of back-and-forth debugging and eliminated the cognitive load of tracking state across a long troubleshooting session.

This is the entry point for harness engineering in practice. Start with one service, one loop, a few well-defined skills. The pattern scales from there.

The full technical design — including the more complete vision with code edits, multi-service topology, and deployment gates — is in the appendix at the end of this article.

The Hard Problem: AI Doesn't Have the Whole Picture

My single-service harness worked well — within its scope. But that experience made the next problem obvious.

In real production systems, a microservice is never truly isolated. Every service I've worked with has upstream callers, downstream dependencies, and a surrounding ecosystem — PostgreSQL, Redis, SQS, Lambda workers, other microservices — all of which can cause your service to fail even when your service's code is perfectly fine.

I've seen this pattern more times than I can count. The symptoms show up in service A. Everyone debugs service A. Hours later someone notices that service B stopped consuming from the SQS queue two hours ago, which caused service A's queue depth to spike, which caused the memory pressure that looked like a code bug. The root cause was three hops away.

A harness that only knows about one service will do exactly what a junior engineer does: fix symptoms confidently while the real cause sits untouched elsewhere.

The full picture requires the harness to know the topology — what lives upstream and downstream, what ecosystem components exist, and which skill to use to check each one. Before diagnosing anything, it sweeps the entire dependency graph in parallel, accumulates findings from every node, and only then reasons about root cause.

That sweep might involve:

GitHub and GHA — did the deployment itself introduce the issue?
ECS tasks across multiple services — is something upstream unhealthy?
Cloudwatch logs across service boundaries — where did errors first appear?
PostgreSQL — connection pool exhaustion, slow queries, blocking locks
Redis — memory pressure, eviction policy changes, connection refusals
SQS — queue depth, dead-letter queue size, consumer lag
Lambda — throttling, cold start storms, downstream retry cascades

This is what a senior engineer does instinctively when they get paged. They don't open the failing service first — they open a mental map of everything connected to it and start ruling things out. The harness needs the same instinct, but it has to be given the map explicitly.

Building that map, keeping it accurate as the system evolves, and knowing what to include — that's not a technical problem. It's a judgment problem. And it's entirely on the engineer, not the harness.

The Boundary That Must Stay Human

One more constraint that comes directly from experience.

The harness I built operates only in lower environments, on feature branches. It checks GHA, it inspects ECS, it reads logs, it proposes and applies fixes. But it never merges to main. It never touches production. When it's satisfied that the fix holds, it opens a PR with the full debug history attached — and stops.

A human reads it, reviews the diff, and decides whether it goes forward.

This isn't just a safety rule. It reflects something real about where AI judgment currently breaks down. The harness is excellent at iteration within a defined scope — it holds state, tries things systematically, doesn't get tired. But it has no awareness of the things that make a production decision hard: what other teams are deploying this week, whether there's a compliance review pending, what the blast radius looks like at 3am on a Friday, whether the business can absorb a rollback if something goes wrong.

Those calls require context that lives outside the codebase. That context lives with people.

The machine does the iteration. The human makes the promotion decision. That division is the right design — not a temporary limitation to be engineered away.

Coding is Cheap. Engineering is Not.

AI is replacing most of the coding work. It is not replacing the engineering judgment.

Understanding infrastructure as a whole system — how failure propagates, where the real blast radius sits, what the topology actually looks like versus what the documentation says — that knowledge is becoming the scarce resource. Not syntax. Not boilerplate. Not even algorithms.

The engineers who thrive in this era are the ones who can hand a well-designed harness a well-defined problem, watch what it does, and know exactly when its confidence is outrunning its understanding. That's a harder skill to develop than writing code. And it's much harder to automate.

Appendix: Technical Design of the Microservice Debugging Harness

Skills

Skill	MCP / Tools	Purpose
GitHub Skill	GitHub MCP	Branch management, PR creation, GHA pipeline monitoring
AWS Skill	AWS MCP	ECS cluster, service, and task health verification
Cloudwatch Skill	AWS MCP	Log retrieval, error filtering, stack trace parsing
PostgreSQL Skill	Postgres MCP	Slow query analysis, connection pool status, schema verification
SQS Skill	AWS MCP	Queue depth, DLQ size, consumer lag
Redis Skill	AWS MCP	Memory usage, eviction rate, connection count
Lambda Skill	AWS MCP	Error rate, throttle count, duration, cold starts
HTTP Health Skill	HTTP tool	Upstream and downstream service health endpoints
Code Reader Skill	GitHub MCP	Fetch source files relevant to the error
Code Editor Skill	File edit + GitHub MCP	Apply fix to source code
Commit/Push Skill	GitHub MCP	Version the change on feature branch
GHA Watcher Skill	GHA MCP	Poll pipeline run, read failure logs
Deploy Waiter Skill	AWS MCP	Wait for ECS task stabilization after rollout
Load Test Skill	HTTP / Playwright	Trigger load and UI click flows against lower env

Execution Flow

Main Orchestrator (loop until healthy)
│
├── Phase 1: Full topology sweep (parallel)
│     ├── upstream health check
│     ├── self: ECS tasks, Cloudwatch errors, GHA deploy status
│     └── downstream: Postgres, Redis, SQS, Lambda, HTTP health
│
├── Reasoning: model reads combined findings, identifies root cause
│     └── decides: infra fix OR code fix
│
├── Action
│     ├── infra fix: AWS Skill → update ECS task definition, env vars
│     └── code fix: read source → edit → commit → push → watch GHA → wait for ECS
│
├── Verify Phase
│     ├── ECS tasks stable?
│     ├── Cloudwatch: error rate below threshold?
│     └── Postgres: no blocking queries?
│
├── Test Phase
│     └── Load test + UI test against lower env
│           ├── pass → open PR with debug summary → DONE
│           └── fail → append findings to debug context → loop back
│
└── Escalation condition
      └── if iterations > N → surface findings, open PR, pause for human

Shared Debug Context Object

Each iteration appends a full record so the model never repeats a fix that already failed:

{
  "service": "payments-api",
  "iteration": 3,
  "history": [
    {
      "iteration": 1,
      "diagnosis": "OOMKilled - exit code 137",
      "action": "infra fix - increased ECS memory to 2048",
      "result": "fail - still OOMKilled at 2048"
    },
    {
      "iteration": 2,
      "diagnosis": "memory leak in batch processor",
      "action": "code fix - reduced batch size 1000 → 100",
      "commit": "a3f9c12",
      "gha": "pass",
      "result": "fail - new error: DB connection timeout"
    },
    {
      "iteration": 3,
      "diagnosis": "connection pool exhausted after batch fix",
      "action": "code fix - added pg connection pool limit",
      "commit": "b7e2d45",
      "gha": "pass",
      "result": "pending"
    }
  ]
}

Service Topology Map

{
  "service": "payments-api",
  "upstream": [
    {"name": "api-gateway",  "type": "http", "skill": "http-health"},
    {"name": "frontend-app", "type": "http", "skill": "http-health"}
  ],
  "downstream": [
    {"name": "postgresql",       "type": "db",      "skill": "postgres"},
    {"name": "redis",            "type": "cache",   "skill": "aws-elasticache"},
    {"name": "sqs-payments",     "type": "queue",   "skill": "aws-sqs"},
    {"name": "lambda-worker",    "type": "compute", "skill": "aws-lambda"},
    {"name": "notification-svc", "type": "http",    "skill": "http-health"}
  ]
}

Human Gates

Gate	Condition
PR review and merge	Always — harness opens PR, human approves
Production deployment	Always — human driven
DB schema changes	Require explicit approval before harness proceeds
Iteration escalation	If harness exceeds N iterations with no progress

Rex Zhen is a Senior Site Reliability Engineer specializing in Cloud Infrastructure & AI/ML. Follow him on LinkedIn for more on cloud architecture, SRE, and the evolving role of AI in engineering.

AI #SoftwareEngineering #HarnessEngineering #DevOps #Microservices #AIEngineering #Automation #CloudArchitecture #SRE

DEV Community

Harness Engineering: The Next Evolution of AI Engineering

Harness Engineering: The Next Evolution of AI Engineering

The Three Eras

Era 1: Prompt Engineering

Era 2: Context Engineering

Era 3: Harness Engineering

What a Harness Actually Looks Like

A Real-World Example: Autonomous Microservice Debugging

The Hard Problem: AI Doesn't Have the Whole Picture

The Boundary That Must Stay Human

Coding is Cheap. Engineering is Not.

Appendix: Technical Design of the Microservice Debugging Harness

Skills

Execution Flow

Shared Debug Context Object

Service Topology Map

Human Gates

AI #SoftwareEngineering #HarnessEngineering #DevOps #Microservices #AIEngineering #Automation #CloudArchitecture #SRE

Top comments (0)