Shadow Deployments for AI Agents: Canary Your Prompt Changes Before They Burn Production

#ai #agents #observability #evaluation

You shipped your agent. Evals were green. A week later you tweak the system prompt to fix one annoying edge case, the CI eval suite passes, you merge, and the next morning your support queue is on fire because the agent now refuses half the legitimate requests it used to handle.

This is the part nobody talks about: passing a pre-merge eval is not the same as knowing a change is safe in production. Your eval suite grades the cases you thought to write down. Production has cases you didn't. The gap between those two sets is exactly where agent changes go to die.

The fix is not "write more tests." It's borrowing something web infra has had for fifteen years and almost no agent team uses: shadow deployments and canary evals.

The deploy model agent teams skipped

When you deploy a normal service, you don't flip 100% of traffic to the new version and pray. You run a canary — 1%, then 5%, then 25% — and you watch error rates, latency, and saturation at each step. If the new version regresses, you halt and roll back before most users ever touch it.

Agent teams skipped this entirely. The typical agent "deploy" is: edit prompt, run offline evals, merge, full rollout. There's no canary because there's no obvious metric to canary on. HTTP 500s are easy. "The agent's answers got subtly worse" is not.

But it is measurable — if you have two things wired together: a way to score output quality continuously, and a way to see exactly how each answer was produced so a bad score is debuggable instead of mysterious. That's the entire reason agent-eval and AgentLens ship as a unit and not as two separate tools you bolt on later.

agent-eval scores and gates the agent's output: deterministic checks, drift detection against a baseline, hallucination/grounding checks. It answers "is this answer good?"
AgentLens captures the trace of how that answer was produced — every model call and tool step, the resolved inputs, the raw outputs. It answers "why did this answer come out the way it did?"

A canary score with no trace tells you the new version is worse but not where. A trace with no score tells you what happened but not whether it mattered. You need both, on the same request, or the canary is just a vibe with a percentage sign.

Shadow mode: grade the new version on real traffic before it serves anyone

The cheapest, safest first step is shadow deployment. Take live production requests, run them through both the current (champion) agent and the new (challenger) agent, but only return the champion's answer to the user. The challenger runs in the dark. You score both with agent-eval, trace both with AgentLens, and compare — on real traffic, with zero user risk.

import { evaluate } from "agent-eval";
import { trace } from "agentlens";

interface ShadowResult {
  requestId: string;
  championScore: number;
  challengerScore: number;
  championTraceId: string;
  challengerTraceId: string;
}

async function shadowCompare(
  request: AgentRequest,
  champion: Agent,
  challenger: Agent,
): Promise<ShadowResult> {
  // Champion serves the user; its trace is captured for debugging.
  const champRun = await trace("champion", () => champion.handle(request));

  // Challenger runs in the shadow — same input, result never returned.
  const challRun = await trace("challenger", () => challenger.handle(request));

  const [championScore, challengerScore] = await Promise.all([
    evaluate(champRun.output, {
      checks: ["grounding", "format", "drift"],
      baseline: request.baseline,
    }),
    evaluate(challRun.output, {
      checks: ["grounding", "format", "drift"],
      baseline: request.baseline,
    }),
  ]);

  return {
    requestId: request.id,
    championScore: championScore.value,
    challengerScore: challengerScore.value,
    championTraceId: champRun.traceId,
    challengerTraceId: challRun.traceId,
  };
}

Run this for a day across a few thousand real requests and you get something an offline suite can never give you: the challenger's score distribution on your actual traffic, not your imagination of it. When the challenger underperforms on some slice — say, multi-step tool requests — you don't argue about it. You pull the AgentLens trace pair for those request IDs and look at exactly where the two runs diverged: which tool got different inputs, which model step produced the regression.

Canary: promote on a score gate, not a calendar

Once shadow mode says the challenger is at least as good, you let it serve a small slice of real traffic and gate promotion on the live score — not on "it's been a week and nothing exploded."

interface CanaryGate {
  stage: number;          // current traffic percentage
  minScore: number;       // challenger must hold this
  maxDelta: number;       // and not regress vs champion by more than this
  minSamples: number;     // before any decision
}

function decideCanary(
  results: ShadowResult[],
  gate: CanaryGate,
): "promote" | "hold" | "rollback" {
  if (results.length < gate.minSamples) return "hold";

  const avg = (xs: number[]) => xs.reduce((a, b) => a + b, 0) / xs.length;
  const challengerAvg = avg(results.map((r) => r.challengerScore));
  const championAvg = avg(results.map((r) => r.championScore));
  const delta = championAvg - challengerAvg;

  if (challengerAvg < gate.minScore) return "rollback";
  if (delta > gate.maxDelta) return "rollback";
  return "promote";
}

The important detail: every rollback decision is attached to a set of failing challengerTraceIds. The gate doesn't just say no — it hands you the exact traces that caused the no. That is the difference between "the canary failed, somebody look into it eventually" and "the canary failed on these 14 requests, here are the traces, the regression is in the retrieval tool call." One of those gets fixed today.

Why the two halves have to be one workflow

You can technically buy a scoring tool and a tracing tool separately and duct-tape them. Most teams that do this end up with scores in one dashboard and traces in another, joined by nothing, and when a canary regresses someone spends an afternoon trying to line up timestamps to figure out which trace goes with which bad score.

The reason agent-eval and AgentLens are designed as a unit is that the eval signal is only as useful as it is debuggable. A score without its trace is a number you can't act on. A trace without a score is a haystack with no needle marked. Wire them together — same request ID, score and trace produced in the same step — and your canary stops being a guess. The eval tells you that the challenger regressed; the trace tells you exactly where, so the rollback comes with a root cause instead of a shrug.

The takeaway

Stop treating agent changes as merge-and-pray. You already accept canaries and shadow traffic for every other service you run; your agent deserves the same discipline, and it needs it more, because its failure mode is quiet degradation, not a loud 500.

Shadow new versions against real traffic. Gate promotion on a live score, not the calendar. And make sure every score comes welded to the trace that produced it — agent-eval to tell you whether the change is safe, AgentLens to tell you why — because a canary you can't debug is just a slower way to ship the same regression.

Top comments (1)

Max Quimby • Jun 28

This nails the gap I keep running into — the eval suite grades the cases you remembered, prod is the cases you didn't. One thing I'd add from running agent changes in production: the canary math is harder than the web-infra version because the metric is noisy. With HTTP 500s, 1% traffic gives you a clean signal fast. With "answers got subtly worse," LLM non-determinism means a regression at small sample sizes can just be variance — you can talk yourself into thinking a fine prompt is bad, or wave through a real regression because the canary slice happened to draw easy inputs. What's worked better for us is replaying the same inputs through old and new and diffing scores pairwise, instead of comparing aggregate quality across two different traffic slices. The trace half is the unsung hero here — a bad score you can't reproduce is just anxiety. Curious how you decide a canary has seen enough traffic to trust the verdict?