Saurav Bhattacharya

Posted on Jun 14

The Five Agent Failure Modes Nobody Catches in Staging

#ai #agents #observability #testing

Treats agents as distributed systems

Every agent failure I have ever debugged in production had the same property: it passed staging. Not because staging was badly written, but because the failure mode simply does not exist until you have real traffic, real latency, real tool flakiness, and a real distribution of inputs you never thought to enumerate.

We keep talking about agents like the open question is "is the model good enough." It usually is. The open question is whether the system you wrapped around the model degrades gracefully when reality stops cooperating. Below are the five failure modes I see most often, why none of them show up in a clean test suite, and what to actually instrument.

I am going to be opinionated here because I think the industry is still treating agents like prompts instead of like distributed systems. They are distributed systems. Act accordingly.

1. The silent tool downgrade

Your agent calls a search tool. The search tool times out. The agent, being a helpful language model, does not surface the timeout — it confidently answers from parametric memory instead. The user gets a fluent, plausible, stale answer. No error is thrown. No alert fires. Your latency dashboard is green.

This is the single most dangerous failure mode because it looks exactly like success. In staging your tools never time out, so you never see the agent's behavior when a tool returns nothing useful. The model has been trained to be helpful, and "I could not retrieve that" feels unhelpful to it, so it papers over the gap.

The fix is not a better prompt. The fix is to make tool degradation a first-class signal you can detect after the fact. You need the resolved tool input, the raw tool output (including the empty or errored one), and the final answer in the same trace, so an eval can ask: did the model cite a tool result that did not actually exist?

2. The loop that technically terminates

Agents that can call tools in a loop will, eventually, find an input that makes them call the same tool with slightly different arguments forty times before giving up. It terminates — so your tests pass — but it burns tokens, blows your latency budget, and produces a degraded answer at the end.

Staging never hits it because the adversarial input that triggers the loop is some malformed customer query you would never write by hand. Production writes it for you on day one.

interface AgentStep {
  stepType: \"model\" | \"tool\";
  tool?: string;
  argsHash: string;      // hash of resolved tool arguments
  durationMs: number;
}

function detectPathologicalLoop(steps: AgentStep[]): {
  looping: boolean;
  repeatedTool?: string;
  repeats: number;
} {
  const counts = new Map<string, number>();
  for (const step of steps) {
    if (step.stepType !== \"tool\" || !step.tool) continue;
    const key = `${step.tool}:${step.argsHash}`;
    const next = (counts.get(key) ?? 0) + 1;
    counts.set(key, next);
    if (next >= 3) {
      return { looping: true, repeatedTool: step.tool, repeats: next };
    }
  }
  return { looping: false, repeats: 0 };
}

The point of the code is not the threshold. The point is that you cannot write this check at all unless every step — model and tool — is captured with its resolved arguments. If your logs only show \"agent called search 40 times\" without the argument hashes, you cannot distinguish a healthy retry from a doom loop.

3. Distribution drift that never throws

Your agent was evaluated on a golden set in March. It is June. Your users now ask about a product that did not exist in March, in a phrasing your few-shot examples never anticipated. The agent does not crash. It just gets quietly worse — answer quality drops three percent a week and nobody notices until support tickets spike.

This is not a bug you can catch with a unit test, because the code did not change. The world changed underneath a system you froze. The only defense is continuous scoring of production outputs against a rubric, with the score trended over time so the slope is visible before the cliff.

4. The non-deterministic regression

You change the system prompt to fix one annoying behavior. You eyeball ten outputs. They look great. You ship. Three days later a different, rarer behavior has gotten worse, and because the model is non-deterministic you cannot even reproduce the original good output to compare against.

The mistake here is treating a prompt change like a config tweak instead of like a code change that needs a regression suite. Every prompt edit is a deploy. It deserves the same gate a deploy gets: run it against a held-out scored set, and block the merge if aggregate quality regresses, even if your one cherry-picked example improved.

5. The correct answer at the wrong time

The agent eventually produces the right answer — after eleven seconds and four tool calls, by which point the user has already left. Correctness and usefulness are not the same metric, and most eval harnesses only measure the first one. In production, a right answer that arrives outside the latency SLO is a failure, full stop.

The two halves you actually need

Here is the through-line across all five: every one of them is invisible if you only look at the final output, and every one is trivial to catch if you can see both the score and the trace that produced it. That split is why I run two tools as a single workflow rather than picking one.

agent-eval is the gate on the output. It scores answers against a rubric, runs deterministic checks where it can and model-as-judge where it must, tracks drift over time, and flags hallucinations — and critically, it can fail a build or block a release when aggregate quality regresses. It answers \"is this answer good enough, right now, across the distribution?\"

AgentLens is the trace of how the agent got there. It captures every model and tool step, the resolved inputs to each call, and the raw outputs — including the errored tool call, the empty search result, the forty-times-repeated argument. It answers \"why did the agent produce this?\"

You need both because a score without a trace is a number you cannot act on. agent-eval tells you answer quality dropped four percent this week; AgentLens tells you it is because the retrieval tool started timing out and the model started answering from memory — failure mode number one, now visible instead of silent. The eval gives you the alarm; the trace gives you the root cause in the same view. Run them apart and you are stuck staring at a red dashboard with no idea which of the five modes you are looking at.

async function gateRelease(traceId: string): Promise<boolean> {
  const trace = await agentLens.getTrace(traceId);     // every step, resolved I/O
  const result = await agentEval.score(trace.output, {
    rubric: \"support-quality-v3\",
    checks: [\"no-uncited-claims\", \"within-latency-slo\"],
    judge: \"model-as-judge\",
  });

  if (!result.passed) {
    // The score told us it failed; the trace tells us why.
    const loop = detectPathologicalLoop(trace.steps);
    console.error(\"release gate failed\", {
      score: result.score,
      reasons: result.failedChecks,
      looping: loop.looping ? loop.repeatedTool : null,
    });
  }
  return result.passed;
}

What to do Monday

You do not need to solve all five at once. You need to stop pretending staging covers them. Pick the one that scares you most — for most teams it is the silent tool downgrade — and make it observable: capture the full trace, write the eval that detects it, and wire that eval into something that can actually block a bad release.

The agents are good enough. The systems around them are what fail. Build the systems like you mean it.

Top comments (10)

Mykola Kondratiuk • Jun 18

one i'd add: context window depletion mid-workflow. it looks like tool flakiness or a weird loop. it's actually a capacity error. staging inputs are always shorter than the prod scenarios that hit it.

Andy Leo (AndyLeo) • Jun 14

Really liked the framing here that the model usually isn’t the main problem, the surrounding system is. The “silent tool downgrade” point feels especially real because it can look exactly like success unless you capture the raw tool result next to the final answer. Also appreciated the reminder that “correct but too late” is still a failure in production. The resolved-argument trace example was a good concrete detail too, because that’s what lets you tell a healthy retry from a hidden loop.

Theo Valmis • Jun 15

Failure mode 1 is the one of the five you can prevent, not just detect, and it's worth treating differently from the other four. Your fix, reconstruct the trace and have an eval ask whether the model cited a tool result that didn't exist, is detection after the fact. That's the right call for 2 through 5, because those failures live in input distributions you can't enumerate ahead of time. But "the search tool returned empty" is a knowable state at call time, not something you should have to recover from a trace later. The reason it slips through is that the model treats an empty result as weak evidence to reason around instead of a hard stop. So the structural fix for 1 is to make "tool returned nothing useful" a control-flow branch the model can't talk past, the same way you'd never let a function silently proceed on a null it was meant to handle. Detect the failures you can't enumerate. Refuse the one you can.

TxDesk • Jun 15

Number one is the one that bit me this week, and in a form a bit nastier than the model papering over a failure. In my case the tool call succeeded. It returned a clean, well-formed, empty result. The degradation happened a layer below the model: a full-history data scan silently fell back to a bounded recent-window scan when its primary source failed, and the bounded scan found nothing, so it returned "nothing found" with total confidence. The model never papered over anything. It faithfully relayed a result that was itself a false-clean.

What makes this strain dangerous is that your fix for classic #1 ("did the model cite a tool result that did not exist?") does not catch it. The tool result existed. It was just incomplete, and nothing in the result said so. The empty-success and the genuine-zero are byte-identical at the model's boundary.

The only thing that would have caught it earlier is the tool itself distinguishing "I completed the scan and found zero" from "I could not complete the scan," and refusing to collapse the second into the first. Which is really your two-halves point pushed down one level: the trace has to carry not just what each tool returned, but whether the tool was operating in a degraded mode when it returned it. A green tool call is not the same as a complete one.

Found it, for what it's worth, by manually creating the exact state the tool should have flagged and watching it report all-clear. Staging would never have produced that input. Production, or a deliberate test, does.

Andrii Krugliak • Jun 15

The silent tool downgrade is the one that burned us worst. We stopped letting agents paper over a dead tool with parametric memory and made them surface the failure instead, because a confident stale answer looks exactly like success on a green dashboard. Now "the tool returned nothing useful" is a state the agent has to handle out loud, not an edge case we hope never happens.

Deepak Satyam • Jun 15

What ties all five together: none of them throw. Staging is built to catch exceptions, and not one of these raises one — the silent downgrade, the loop
that "succeeds," drift that never errors. Exit 0, green logs, quietly wrong output. That's the whole reason staging waves them through.

The one I'd push on is the fix. Output scoring (agent-eval) only helps when you have a ground truth to score against — and for the open-ended tasks
where #1 and #3 hurt most, you usually don't. So the trace capture ends up carrying it, with eval reserved for the slice you can actually pin down. How
are you handling the no-reference-answer case? That's where I keep getting stuck.

Marcus Chen • Jun 16

Good list. The category I would add from the voice side: failures that only appear under real audio conditions, which staging almost never reproduces. End-of-turn detection that works on clean test audio falls apart with background noise and people who pause mid-sentence. Barge-in that is fine one-at-a-time breaks under real latency. ASR confidence that looks great on scripted prompts drops on accents and numbers. None show up in a text-based staging harness, because the failure is in the audio layer, not the logic. We only caught them once we replayed real recorded calls through the agent instead of synthetic text turns.

Mehmet Can Farsak • Jun 14

I've seen a variant of this with ideation prompts. You ask an agent to explore ideas, and it silently downgrades from divergent thinking straight to tool calls and code generation — no error, just execution drift. The model 'helpfully' acts on the first idea instead of expanding on it.

I put together Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) to address this at the hook level. PreToolUse hooks block tool calls during ideation phases, and three modes (divergent, actionable, academic) let you control the thinking style. Keeps agents from the 'silent downgrade' into premature execution.

Mallory Haigh • Jun 16

Every one of these failure modes is a platform engineering problem wearing an agentic costume. None of them are fixed by a better prompt or a smarter model - they are fixed by an execution layer that makes tool degradation, step capture, and scoring first-class concerns, instead of afterthoughts bolted onto individual agent implementations.

As a practice, this is exactly what platform engineering gives you. The harness handles what happens within a turn; the governance plane handles identity, observability, and the signals you need to catch these failures across every agent on the platform, not just the one you happened to instrument last week 😇. If you are building that infra per-agent, you are rebuilding the same fire suppression system in every room of the building, rather than wiring it all together into a centralized, standardized system.

Theo • Jun 18

Good App