Eyal Bukchin for MetalBear

Posted on Jun 11 • Originally published at metalbear.com

Auto-verifying your AI-SRE's fixes against your real cluster, with mirrord

#sre #devops #ai #kubernetes

Part I of two. Part II will be published next week, and will show the same loop end-to-end against HolmesGPT on a real cluster.

You're rolling out an AI-SRE (your own homegrown in-house AI loop, or one of the new products: Resolve AI, incident.io's Investigator, Datadog Bits). When an alert fires, it proposes a fix: a timeout here, a retry there, a config bump. The suggestion looks plausible. Do you merge it?

Today, you read the diff, eyeball it against your mental model of the service, and decide, or you deploy to staging and wait. The bugs an AI-SRE gets paged on (an unhandled exception under a rare request shape, a downstream that times out for some requests, a webhook that races with itself, a slow query that only surfaces at scale) live in the gap between local and production. Unit tests and mocks don't reach there. Staging does, but staging is slow (a deploy on every iteration) and shared (you're competing with everyone else trying to reproduce their own bugs).

This is where mirrord can help. mirrord runs a separate copy of your service as if it were a pod in your cluster, wired into the same real downstreams and upstreams without a deploy. This guide shows how to wire that into your AI-SRE so every suggested fix is verified against the real cluster before it reaches a human, automatically, in no time, with no SDK or code to import.

What's an AI-SRE?

A new category of tools that take the first investigative pass when an alert fires, before the on-call human is even awake. The AI-SRE pulls together the context a human would have spent the first ten minutes assembling (recent deploys, related logs, metric anomalies, runbooks, similar past incidents) and proposes a remediation. Most just produce suggestions; a human is still the one acting on them. A few are starting to auto-execute low-risk actions (pod restarts, config flag flips, even small auto-merged code patches).

AI-SREs come in three flavors: standalone (Resolve AI, NeuBird, Cleric), AI bolted onto existing incident or observability tools (incident.io's Investigator, Datadog Bits, Rootly's Copilot, etc.), and homegrown LLM loops, which are the most common shape.

Below are some examples of what an AI-SRE actually outputs. Every one of these is a fix that can't be verified without real cluster state, which is why this guide exists:

Alert	AI-SRE's suggestion	What verification has to prove
Spike of `KeyError: 'address'` in checkout handler	Default the field to `None` and guard the deref	The previously-failing requests now return 2xx, no new errors introduced
Checkout p99 latency above SLO	Add `timeout=300ms` to the pricing-service call	The slow outliers get cut off; p99 actually drops
5xx burst after deploy	Roll back commit `abc123`, or guard the new code path with a feature flag	The error rate stops climbing on real production traffic
Webhook handler creating duplicate orders	Check idempotency key in Redis before insert	A retried webhook produces one row, not two
Kafka consumer lag growing	Bump `max.poll.records`, add a backoff on a flaky downstream	Lag actually drains, downstream stays healthy

None of these can be verified with unit tests or mocks.

Meet `mirrord exec`

mirrord exec runs a local process as if it were a pod in your cluster. Its outbound calls resolve through the target pod's network context, so when your patched code calls payments or inventory or your database, it hits the real in-cluster service. mirrord also routes incoming traffic to the local process when you ask it to. Environment variables, mounted ConfigMaps and Secrets, and the target's filesystem are inherited by default. This is what allows your AI-SRE to test the code fix it came up with instantly and without having to deploy to staging or production. The deployed pod is never touched.

Where mirrord fits

What your AI-SRE already owns

This guide assumes three things about your AI-SRE: it (a) ingests an alert, (b) produces a candidate fix (a code diff, or a recommendation that needs a small bridge step to actually turn it into code, covered in Part II), and (c) has some check for whether the bug is fixed (a repro script, a load test, the alerting query, or a known-good test case).

mirrord doesn't replace any of these. It runs your patched service under mirrord exec, so (c) can hit the real cluster, without a deploy.

How to set it up

Prerequisites

Two things to install:

The mirrord operator in your cluster (install docs), one helm command.
The mirrord CLI wherever the AI-SRE will run the verification step: typically the AI-SRE's own worker, or a sidecar container alongside it (install docs).

Automated verification with `mirrord exec`

When the AI-SRE produces a candidate fix, run a short verification block from inside its own hook:

Check out a working copy of the service code (git clone)
Apply the AI-SRE's patch to it (git apply patch.diff, or whatever your AI-SRE delivers)
Run that patched copy under mirrord exec.

mirrord makes the local process behave as if it were the target pod (same network identity, same downstreams, same env), but the deployed service is untouched.

Run it twice, once with the patch, once without. Both runs see the same live downstreams and upstreams, so the only variable is the patch.

You'll want mirrord in steal mode with a header filter, so the verifier diverts only its own tagged requests to the candidate and leaves production traffic alone. The mirrord config (write once, reuse per run):

// .mirrord/verifier.json
{
  "feature": {
    "network": {
      "incoming": {
        "mode": "steal",
        "http_filter": {
          "header_filter": "^X-Verifier-Run: {{ key }}$",
          "ports": [8080]
        }
      }
    }
  }
}

{{ key }} is a mirrord template variable that resolves to whatever you pass via mirrord exec --key, so each verification run can tag itself uniquely without templating the config file.

If your check is a repro script that exits non-zero when the bug is present:

RUN_ID=$(uuidgen)

# Start the candidate; mirrord will steal only requests tagged with $RUN_ID.
mirrord exec --config-file .mirrord/verifier.json --key "$RUN_ID" \
  --target deploy/<your-service> -- ./service/run-server.sh &

# Drive the repro at the real service URL, tagged with the header.
# Other traffic to <your-service> keeps hitting the deployed pods.
./repro.sh https://<your-service>.svc.cluster.local --header "X-Verifier-Run: $RUN_ID"

If your check is metric-based (latency or error rate, the common case for AI-SRE alerts), same shape, swap the repro for a load test:

RUN_ID=$(uuidgen)

mirrord exec --config-file .mirrord/verifier.json --key "$RUN_ID" \
  --target deploy/<your-service> -- ./service/run-server.sh &

./loadtest.sh https://<your-service>.svc.cluster.local --header "X-Verifier-Run: $RUN_ID" > patched-metrics.json

Run that block once on the unpatched code and once on the patched one, diff baseline-metrics.json vs patched-metrics.json, and you have a verdict. The comparison is shaped by the alert: did the SLO condition the alert fires on still hold on the patched run? Plus a sanity check that the other signals didn't regress past tolerance. For an error_rate > 5% alert, check the patched error rate is now under 5%, and that p50/p99 didn't grow past your tolerance. For a p99 > 300ms alert, check p99 is now under 300ms, and that error rate didn't climb. (If your AI-SRE outputs a written report rather than ready-to-apply diffs, you'll also want a small bridge step; Part II walks through one.)

An immediate verdict, against your real cluster. The verifier can also reject. Sometimes the AI's fix moves the metric the alert fires on without moving it enough to clear the SLO, or improves one signal and regresses another. You want to find that out here, not in production.

Wiring it into your AI-SRE

The integration is light. Most AI-SREs already run a hook when they finalize a suggestion (the part that posts the PR comment or the Slack message). In that hook:

Apply the generated patch to a checked-out copy of the service.
Run the verification block above. If it doesn't pass, post the comparison back to the AI-SRE so it can try again (per the diagram). The on-call human keeps working the original alert; they just don't review a fix that didn't survive verification.
If it passes, post the verdict and the diff. A human (or your auto-merge policy) takes it from there.

One caveat on the REJECT-and-retry loop. Today's vendor AI-SREs (HolmesGPT, Resolve AI, incident.io's Investigator) are single-shot. To actually close the loop, put the retry in the small wrapper that turns the AI-SRE's recommendation into a code patch (covered in Part II): feed it the failed run's verdict and ask for another approach. Homegrown AI-SREs can put the retry wherever.

In a noisy incident where your AI proposes ten patches, you spend ten quick verification runs and only the survivors reach a human.

Bonus: hand humans a live preview (mirrord Enterprise)

When verification passes and a human still wants to click around before merging, mirrord preview environments (an Enterprise-tier operator feature) give them an isolated pod from the patched image, reachable at the same service URL with a session header.

mirrord preview start --key fix-$INCIDENT_ID \
  --image registry.example.com/<your-service>:fix-$INCIDENT_ID \
  --target deployment/<your-service> --ttl 30

curl -H 'baggage: mirrord-session=fix-$INCIDENT_ID' \
  https://<your-service>.prod.svc/...

Other traffic keeps hitting the unpatched pods; the header is what routes you to the candidate. After the TTL the operator tears it down. See the preview environments docs for details.

That's the recipe. Part II (coming next week) shows the loop running end-to-end against HolmesGPT, an OSS AI-SRE that ships standalone. Two planted bugs in a real GKE cluster: one the loop signs off on, one the loop correctly rejects.

Why not just stage a copy of the service?

You could run a copy of checkout in a separate namespace, point it at the real pricing, and call that verification. The reasons mirrord exec is a better fit specifically for AI-SRE auto-verification:

Identity matches prod. mirrord exec inherits the target pod's ServiceAccount, OIDC token, and service-mesh identity. A staged copy gets its own. The bugs AI-SREs handle often involve auth, mTLS, or NetworkPolicy interactions where identity is the variable; if your verification runs as a different principal than prod, you can pass on a fix that prod would still reject.
Env, secrets, mounts inherited automatically. mirrord pulls the target pod's environment variables, mounted ConfigMaps and Secrets, and filesystem. Staged copies need a parallel set kept in sync; drift = false verdict. AI-SRE patches frequently touch env-shaped behavior (a missing flag, a stale config), which is exactly the drift class you want verification to catch.
Seconds, not minutes. mirrord exec runs from a checked-out working copy in a few seconds. The staged-namespace approach is build image, push, kubectl apply, wait-ready, port-forward, several minutes per try. In a noisy incident where the AI tries ten patches, that's the difference between "all ten verified in under five minutes" and "one verification per coffee."
Parallel runs are free. Each mirrord exec is its own process. Ten candidate patches can verify concurrently. The staged-copy approach shares a Deployment, Service, and Ingress, so concurrent runs collide unless you also templatize per-run namespaces.

Try it

Setup against any cluster you can kubectl into:

# install the CLI
brew install metalbear-co/mirrord/mirrord

# install the operator in your cluster (Prerequisites above)

# pick any deployment and confirm the loop works
mirrord exec --target deploy/<your-service> -- printenv | head -10

If that prints env vars from the target pod (not your local shell's), the verification primitive is live. Wire your AI-SRE's post-suggestion hook to the pattern in Automated verification above and let the next alert prove itself. Then read Part II (coming next week) for the full walkthrough against HolmesGPT.

mirrord: https://metalbear.com/mirrord
Questions: hi@metalbear.com · https://metalbear.com/slack

Top comments (5)

Lazypl82 • Jun 15

This is the cleanest framing of pre-merge verification for an AI-SRE loop I've read. The part I keep thinking about is what happens after the patch passes the mirrord exec check and ships. The same loop seems to repeat in a different shape, where the AI-SRE still has to read the first few minutes of real production after the deploy and decide whether to walk away or roll back. mirrord exec gives you a strong "this fix doesn't break the cluster in isolation," but the post-deploy window is where the patched behavior meets actual user traffic patterns the verifier didn't run against. Curious if Part II touches on that handoff, or if you treat verified-pre-merge as the strong-enough cutoff in practice.

Eyal Bukchin MetalBear • Jun 24

Thanks for the comment! First of all, Part II is out now: dev.to/metalbear/auto-verifying-yo...

You've got a good point, but running mirrord inside a production cluster introduces compliance/security concerns we're just not interested in getting into at the moment. In the context of this post though, if the fix doesn't work, the AI-SRE would flag it in production and start the reproduction-verification loop on staging again.

Lazypl82 • Jul 14

Late reply, but glad Part II is out. Fair call on compliance. The production flag step is the part that interests me, whatever decides the fix didn't work is doing its own verification pass there, and that check rarely gets built with the same care as the staging loop.

Luis Cruz • Jun 11

This is an excellent practical guide on auto-verifying AI-SRE fixes in a live Kubernetes cluster. I really appreciate how you explain why traditional staging can miss auth/mTLS/network interactions, and how mirrord exec provides an isolated yet real-cluster environment for testing candidate patches safely. The step-by-step setup, including header-based request stealing and parallel verification runs, is immediately actionable for anyone building or integrating AI-SRE workflows.

I’m curious—have you found any edge cases where mirrord exec couldn’t fully replicate production behavior, or where certain fixes still needed human validation? I’d be happy to help brainstorm additional verification strategies or integrations for other observability tools.