Saurav Bhattacharya

Posted on Jun 30

Your Model Upgrade Broke Three Workflows and the Tests Still Passed

#ai #testing #agents #evaluation

Every team that runs agent evals eventually hits the same wall: your suite was green on Friday, you bumped the model from gpt-4.1 to gpt-4.1-2026-05, and on Monday three workflows behave differently. Not broken in any way an exception catches. Just... different. A tool call that used to fire now doesn't. A summary that used to cite the doc now paraphrases it. Your pass rate is still 94%. Nobody knows what changed.

This is the regression problem, and it is not the same as drift.

Drift is decay. Regression is a step function.

Drift is what happens to a fixed configuration over time — the world moves, your data moves, and the agent's behavior slowly decays against a baseline it was never re-anchored to. I've written about catching that before.

Regression is different. Regression is the discontinuity you introduce yourself: a model version bump, a prompt edit, a new tool in the registry, a temperature change someone made "to see what happens." The config changed at a known timestamp. The behavior changed with it. And because most agent output is non-deterministic prose, the change hides in the 6% you weren't looking at.

The instinct is to throw a model-as-judge at it: "compare the old answer to the new answer, tell me if it got worse." Resist that instinct. A judge comparing two outputs gives you an opinion about which one reads better. It does not give you a fact about what structurally changed. And if you're judging the agent's reasoning trace, you've made it circular — the judge and the judged share a substrate, so there is no independent ground truth. You're asking a model to grade a model with no anchor.

What you actually want is a golden trace: a pinned, known-good trajectory that the new version has to be compared against on axes the agent can't talk its way out of.

Rank your regression checks by independence, not by cost

The mistake people make is ranking evidence cheap-to-expensive. The axis that matters is independent → corruptible — how much the agent can forge the signal.

Tier 1 — proof the agent can't fake. Did the new version still call the fetch_invoice tool when the task required it? Did it still produce valid JSON against the schema? Did it finish inside the timeout? Did the file it claims to have written actually exist on disk? These are externally observable facts. The agent's prose cannot argue with them. If the golden trace called two tools and the new run calls one, that's a regression — full stop, no judgment required.

Tier 2 — statistical signal vs a baseline the agent didn't author. Embedding similarity between the new output and the golden output. Did the diff actually change something meaningful, or is it cosmetic? Length and repetition deltas. None of this needs a model's opinion; it needs the old artifact as a fixed reference point. The agent didn't write the golden trace, so it can't game the comparison.

Tier 3 — model-as-judge. Yes, there's a subjective tail: "is this summary still faithful to the source?" can't always be reduced to a number. But Tier 3 is a signal, never a verdict, and it has two hard constraints. It's offline only — it's metered, slow, and non-deterministic, so it can never sit in the hot path gating a deploy. And it may only inspect artifacts the judged agent didn't get to write — the final output against the source document, never the agent's own chain-of-thought, because grading reasoning with another model is the circular trap.

Tier 1 + Tier 2 are your real-time gate: deterministic, effectively free, fast enough to block a CI run. They catch the overwhelming majority of regressions — the missing tool call, the broken schema, the answer that went from grounded to hand-wavy. Reserve the judge for the ~20% subjective tail, and label its output "opinion, not evidence" so nobody mistakes a 7/10 for a green light.

What this looks like in code

Here's a regression gate built on this split. agent-eval scores the new run against the golden trace; the Tier 1+2 checks run synchronously and can fail the build, while the judge is fired off-line.

import { evaluate, tier1, tier2 } from "agent-eval";
import { loadTrace } from "agentlens";

interface RegressionResult {
  passed: boolean;
  blocking: string[];   // Tier 1+2 failures — these fail CI
  advisory: string[];   // Tier 3 opinions — logged, never blocking
}

async function checkRegression(
  golden: string,        // pinned trace id, known-good
  candidate: string,     // new run after the model/prompt bump
): Promise<RegressionResult> {
  // AgentLens gives us the full trace: every model + tool step,
  // resolved inputs, raw outputs — data the agent never got to rewrite.
  const g = await loadTrace(golden);
  const c = await loadTrace(candidate);

  const blocking: string[] = [];

  // Tier 1: externally observable proof. No model in the loop.
  if (g.toolCalls.length !== c.toolCalls.length) {
    blocking.push(
      `tool-call count changed: ${g.toolCalls.length} -> ${c.toolCalls.length}`,
    );
  }
  if (!tier1.validJson(c.finalOutput, g.schema)) {
    blocking.push("output no longer matches schema");
  }
  if (c.durationMs > g.durationMs * 1.5) {
    blocking.push(`latency regressed ${g.durationMs}ms -> ${c.durationMs}ms`);
  }

  // Tier 2: statistical signal vs the golden artifact the agent didn't author.
  const sim = tier2.embeddingSimilarity(c.finalOutput, g.finalOutput);
  if (sim < 0.82) {
    blocking.push(`output diverged from golden (cosine ${sim.toFixed(2)})`);
  }

  // Tier 3: judge — OFFLINE, advisory only. Inspects final output vs the
  // SOURCE doc, never the candidate's own reasoning trace (that'd be circular).
  const opinion = await evaluate.judge({
    artifact: c.finalOutput,
    groundTruth: c.sourceDoc,   // not g.reasoning — independence matters
    rubric: "faithfulness-to-source",
    mode: "offline",
  });

  return {
    passed: blocking.length === 0,
    blocking,
    advisory: opinion.score < 7 ? [`judge faithfulness ${opinion.score}/10`] : [],
  };
}

Notice what's load-bearing here. The Tier 1+2 checks only work because agentlens captured the whole trace — the actual tool calls, the resolved inputs, the raw outputs — and not a post-hoc summary the agent narrated about itself. If all you logged was the final string, you have nothing unforgeable to diff against. The trace is what makes the regression debuggable: when the gate fails on "tool-call count changed," you can open both trajectories side by side and see exactly which step the new model skipped.

That's the actual division of labor. AgentLens captures how the agent got there — every model and tool step, every resolved input, every raw output. agent-eval scores what it produced — the tiered gate above. One without the other is half a system: a trace with no scoring is just expensive logging, and a score with no trace is a red light you can't diagnose.

The discipline

Pin a golden trace for every workflow you care about. Re-run it on every model bump and prompt change. Fail the build on Tier 1+2 — the structural facts the agent can't forge. Let the judge whisper its opinion about the subjective tail, offline, clearly labeled as a signal and not a verdict.

The teams that get burned by a model upgrade aren't the ones whose evals failed. They're the ones whose evals passed because they were grading vibes instead of pinning facts. Pin the facts. The model will change under you whether you're watching or not.

Top comments (4)

Tae Kim • Jul 1

The independence-over-cost axis is exactly right, and it's the frame I use for LangGraph agent regression suites too. The point about the golden trace being its own liability is real though — in production, I pin the golden trace to a tagged config version and re-certify it whenever I change the tool registry or the system prompt, so the reference point doesn't silently rot. The tier where I've seen the most day-one regressions is exactly your Tier 1: tool-call presence and schema validity, because those fail in ways prose never will and they're trivially automated. Model-as-judge is useful for the subjective tail, but if your Tier 1 checks are already failing, running a judge on top is noise.

TxDesk • Jul 1

the independence ranking is the right axis, and tier 1 being "facts the agent can't forge" is the load-bearing move. the thing i'd add is one level up: the golden trace is itself a proxy, and it rots the same way the stale doc does.

tier 1 catches change rigorously, tool-call count went 2 to 1, schema broke, latency regressed. but it can't tell regression from improvement, because that distinction lives in intent, not in the trace. the new model calling one tool instead of two might be the fix, not the regression, and a golden pinned six months ago will flag it red with full confidence. the anchor is unforgeable, but it's an unforgeable snapshot of what "good" meant on the day you pinned it.

which doesn't break the system, it just names its floor. the tiers make change detectable and debuggable, which is the hard 90%. but the last call, is this diff a regression or did the golden go stale, is a human looking at the flagged trace and deciding whether the baseline still represents what they want. same shape as everything else here: the machine catches what's checkable, and the one thing it can't check is whether the thing it's checking against still deserves to be the reference. curious how you handle golden staleness in practice, do you re-pin on a cadence, or only when a flagged diff turns out to be an intended improvement?

Kartik N V J K • Jul 1

Separating regression as a step function from drift as slow decay is a distinction most eval writeups collapse, and it matters because you catch them with completely different tooling. Your point about a judge sharing a substrate with the thing it scores is the trap I see teams fall into right after a model bump, when they let the new model vote on its own behavior. When you build the golden trace, do you pin it at the tool-call and structured-field level, or also anchor the free text so a paraphrase does not read as a false regression?

Pon • Jun 30

I'll steal the independence-over-cost reframe -- 'how much can the agent forge this' sorts the signals better than cheap-to-expensive, since the cheap ones are often the forgeable ones. Where I'd push is one level up: the golden trace itself. Tier 1 and 2 are only as trustworthy as the pin, and the pin is an artifact someone blessed at a timestamp. If the golden was captured while the agent was already subtly wrong -- tool fired but against too broad a scope, summary already paraphrasing -- you've pinned a regression as known-good, and the gate enforces that bug on every later run. The day someone fixes it, Tier 2 cosine divergence fails the build for getting better, and the gate turns into a ratchet defending the first trace instead of the right behavior. So the question I'd add to the discipline: what proves the golden is good, not just first? One concrete spot in the code: count equality on tool calls catches a dropped call but not a swap -- same count, fetch_invoice replaced by a broader fetch, Tier 1 stays green. Independence is the right axis; count is the soft part of it.