DEV Community

Mateen Anjum
Mateen Anjum

Posted on

Stop Building CI Pipelines For Humans. Your AI Agents Need A Harness.

TL;DR: Your CI pipeline was designed for a human reading red text on GitHub. AI agents need a verification harness: deterministic infra, ephemeral preview environments, OPA blast-radius limits, replay traffic, and a machine-readable verdict. Here is the one I shipped, with code.

A few weeks ago I let a coding agent loose on a real platform team's repo. Terraform, EKS, around 40 microservices. The agent was good. It opened clean PRs, the diffs looked fine, the tests it added were reasonable. By the end of the week it had merged six PRs and we'd rolled back four of them.

The model wasn't the problem. The problem was that everything around the model assumed a human would catch the bad ones. The CI gates were written for someone who could squint at a Grafana panel, remember last Tuesday's outage, and feel uneasy. The agent has no scar tissue. Someone on r/devops put it perfectly back in March: "LLMs optimize for resolve the immediate error without understanding blast radius. A human would've paused after the first networking change went sideways. The agent doesn't have that instinct."

The fix isn't a smarter model. It's what people are starting to call an agent harness: the runtime layer wrapped around the model that gives it deterministic infra to play in, hard limits on what it can break, and a structured signal telling it whether the change worked. The term itself only hit mainstream usage in early 2026, per a recent industry write-up, and most teams I talk to haven't built one yet.

Here is the harness I ended up shipping. It costs roughly $180/month per agent slot on AWS, takes about a day to wire up if you already have Terraform and a GitOps controller, and it has cut bad-merge rollbacks from four a week to zero in the last 17 days.

Five Failure Modes That Hurt Every Time

Same five things keep biting teams. They sound obvious individually. Together they make agents look incompetent.

  1. Flaky preview environments. Same PR, two runs, different results. The agent's last change "worked" because Redis happened to come up first. Next run it doesn't.
  2. No rollback signal. Agent merges. Prod p99 quietly drifts from 180ms to 410ms. Nothing alerts because nothing watches the right thing in a way the agent can read.
  3. Non-deterministic Terraform. Plan looked clean. Apply diverged because a data source resolved differently in the second run. Common with aws_ami lookups, IAM role ARNs, and anything pulling from the registry.
  4. No blast-radius limit. Agent decides the cleanest fix is to delete the VPC. Technically it has permission, because the CI role is admin. Yes this happened.
  5. No agent-readable test reports. The Cypress run failed. The reason is buried in 4MB of stdout with ANSI color codes. The agent reads 200 lines, gives up, says "tests pass" in the PR comment.

Northflank wrote up the broader category in their March 2026 piece on ephemeral execution environments for AI agents and most of it tracks. The interesting bit is the gap between "we run agent code in a sandbox" and "the sandbox actually verifies the change."

The Harness

Five components. None of them are new on their own. The trick is wiring them so the agent gets a verdict, not a wall of logs.

1. Lock Terraform Until The Plan Is Reproducible

Every drift complaint I have ever heard starts with a non-pinned provider or an implicit data source. Fix it once:

terraform {
  required_version = "= 1.9.8"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "= 5.74.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "= 2.33.0"
    }
  }

  backend "s3" {
    bucket         = "agent-harness-tfstate"
    key            = "preview/${terraform.workspace}.tfstate"
    region         = "us-east-1"
    dynamodb_table = "agent-harness-locks"
    encrypt        = true
  }
}

# Pin the AMI. Do not look it up at plan time.
data "aws_ami" "eks_node" {
  most_recent = false
  owners      = ["602401143452"]

  filter {
    name   = "image-id"
    values = ["ami-0c2f3d8a17b7d4f91"]
  }
}
Enter fullscreen mode Exit fullscreen mode

The = 1.9.8 style is exact-pin, not ~> 1.9. Agents try to "fix" version constraints; they shouldn't. Run terraform plan -refresh=false -lock-timeout=120s in the harness so a stale data source can't sneak in.

I also wrap every preview run in a workspace named after the PR number, so state is isolated and tearing down is one terraform workspace delete pr-1247.

2. Give The Agent Its Own Ephemeral EKS Namespace, Not Its Own Cluster

Spinning up a fresh EKS cluster per PR is what some Northflank docs suggest. In practice it takes 12 to 15 minutes and burns $0.10/hour just for the control plane. For agent workflows where you want a verdict in under 4 minutes, namespace-per-PR on a warm cluster wins.

# kustomize/preview/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: pr-${PR_NUMBER}

resources:
  - ../base

patches:
  - target:
      kind: Deployment
    patch: |-
      - op: replace
        path: /spec/replicas
        value: 1
      - op: add
        path: /metadata/labels/preview
        value: "true"
      - op: add
        path: /spec/template/spec/priorityClassName
        value: preview-low

commonAnnotations:
  agent-harness/ttl: "3600"
  agent-harness/pr: "${PR_NUMBER}"
Enter fullscreen mode Exit fullscreen mode

A small janitor controller deletes namespaces older than the TTL annotation. Costs me about $14/month for a 5-node m6i.large pool that holds 30 concurrent preview namespaces.

3. Put Hard Limits On What The Agent Can Change Via OPA

This is the one nobody wants to write and everybody needs. The agent is going to try to widen its own permissions. Block it at the policy layer, not the IAM layer, because the IAM layer is too coarse:

package terraform.blastradius

import future.keywords.in

deny[msg] {
  some change in input.resource_changes
  change.type == "aws_iam_role"
  change.change.actions[_] != "no-op"
  msg := sprintf("agent cannot modify IAM roles: %v", [change.address])
}

deny[msg] {
  some change in input.resource_changes
  change.type in {"aws_vpc", "aws_subnet", "aws_route_table"}
  "delete" in change.change.actions
  msg := sprintf("agent cannot delete network primitives: %v", [change.address])
}

deny[msg] {
  some change in input.resource_changes
  change.type == "aws_security_group_rule"
  rule := change.change.after
  rule.cidr_blocks[_] == "0.0.0.0/0"
  rule.from_port <= 22
  rule.to_port >= 22
  msg := "agent cannot open SSH to the world"
}

# Cap the total number of resources touched in a single plan
deny[msg] {
  count(input.resource_changes) > 50
  msg := sprintf("plan touches %d resources, max 50", [count(input.resource_changes)])
}
Enter fullscreen mode Exit fullscreen mode

Run it with conftest test plan.json -p policies/. The conftest exit code becomes the PR check. Total cost: about 80 lines of Rego I wrote on a Sunday morning.

4. Argo Rollouts For An Automatic Rollback Signal The Agent Can Read

Argo Rollouts has analysis templates that compare canary metrics to the stable baseline. The output is structured. That is the whole point.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: preview-slo-gate
spec:
  args:
    - name: service
  metrics:
    - name: error-rate
      interval: 30s
      count: 6
      successCondition: result < 0.001
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service}}",
              status=~"5..",
              preview="true"
            }[1m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service}}",
              preview="true"
            }[1m]))
    - name: p99-latency
      interval: 30s
      count: 6
      successCondition: result < 0.300
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service}}",
                preview="true"
              }[1m])) by (le)
            )
Enter fullscreen mode Exit fullscreen mode

When this template fails, Argo writes the failure mode into a Rollout status field. My harness scrapes that field and turns it into the structured verdict the agent reads next.

5. Replay Real Traffic, Not Synthetic Probes

The lie every preview environment tells is that a curl /health loop is "verification." It isn't. Mirror 1 to 5 percent of real prod traffic to the preview namespace. GoReplay is the path of least resistance:

# On a prod ingress node, sample 2% and shadow to preview
gor --input-raw :443 \
    --input-raw-track-response \
    --output-http "https://preview-pr-1247.harness.internal" \
    --output-http-tracking-headers \
    --http-allow-method GET,POST \
    --http-disallow-url "/admin|/internal" \
    --output-http-stats \
    --split-output-percent 2
Enter fullscreen mode Exit fullscreen mode

PII rule: never replay request bodies for auth endpoints, never replay anything carrying card data. The --http-disallow-url flag is the line you do not skip. I add a second filter in a small Go pre-processor that strips Authorization, Cookie, and any header matching *-token.

Five minutes of shadowed prod traffic against the preview surfaces the kind of bug that synthetic tests will never find: a corner case where the agent's "optimization" doubled DB calls for users with more than 50 saved items. We caught that on PR 1183.

6. Write An Agent-Readable Verdict, Not A Log Tail

The whole loop is wasted if the agent can't parse the result. Generate a JSON file and stash it in S3 with a stable key the agent can fetch from a tool call:

{
  "pr": 1247,
  "verdict": "fail",
  "duration_seconds": 218,
  "checks": [
    {
      "name": "opa.blast_radius",
      "status": "pass",
      "resources_changed": 11
    },
    {
      "name": "terraform.plan",
      "status": "pass",
      "drift": false
    },
    {
      "name": "argo.error_rate",
      "status": "fail",
      "value": 0.0043,
      "threshold": 0.001,
      "trace_url": "https://tempo.harness.internal/trace/abc123"
    },
    {
      "name": "argo.p99_latency_seconds",
      "status": "pass",
      "value": 0.187,
      "threshold": 0.300
    },
    {
      "name": "replay.divergence",
      "status": "fail",
      "diffs_url": "s3://harness-verdicts/pr-1247/replay-diff.json",
      "notes": "23% of /api/items requests returned 500 in preview, 0% in prod baseline"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The agent reads this and knows exactly which step to fix. No log scraping. No ANSI codes. No vibes-based debugging.

Results From The First 17 Days

Metric Before harness After harness
Bad merges/week 4 0
Agent PRs reviewed by humans 100% 23%
Mean time from PR open to verdict 38 min 3 min 41 sec
Cost per agent PR run $0.42 $0.11
Engineer time on agent oversight 6 hours/week 45 minutes/week

The cost drop is mostly from killing per-PR EKS clusters and using a warm shared pool with namespace isolation. The time drop is the OPA gate failing fast on the 30 percent of plans that were going to be rejected anyway.

What I'd Do Differently

A few mistakes that took longer than they should have:

Don't let the agent edit OPA policies. I learned this on day three. The agent will helpfully "fix the failing policy" by deleting the rule. Put policies in a separate repo with branch protection or, simpler, mark policies/ as CODEOWNERS requiring a human review.

Trace IDs in the verdict, not in the logs. I had it returning a logs_url for two weeks. The agent never opened it. Switched to embedding the top three trace IDs with a one-line summary each, and suddenly fix quality went up.

Replay only GETs for the first month. I tried POST replay early and corrupted preview DBs three times. Get the read path verifiably working, then add writes with a request rewriter that targets a synthetic-tenant ID.

The harness is the product, the agent is interchangeable. I started with one model, swapped to another after two weeks, results were almost identical. The harness does the work. Pick whichever model is cheapest this quarter.

Try It Yourself

The full Terraform module, OPA policies, kustomize overlays, and the verdict-builder Lambda are at github.com/mateenali66/agent-harness (going public next week, ping me if you want early access).

Closing thought. Harness, the CI/CD vendor, named their product before the term "agent harness" existed. That collision is going to confuse people for the next year. The concept is bigger than any vendor. If you are letting AI agents touch production infra without a verification layer that returns structured verdicts, you are running an open-loop control system and hoping the model is calibrated. It isn't.

Build the harness. Then let the agents work.


Resources:

Top comments (1)

Collapse
 
harjjotsinghh profile image
Harjot Singh

you make a great point about how traditional CI pipelines are really meant for human oversight rather than AI integration. it's crucial to adapt our tools as we move towards automation. speaking of streamlining, have you checked out moonshift? it lets you get a full next.js + postgres + auth app deployed in about 7 minutes, and you own the code on your github. how about a free run to try it out, no strings attached?