DEV Community

Saurav Bhattacharya
Saurav Bhattacharya

Posted on

The Five Agent Failure Modes Nobody Catches in Staging

Every agent failure I have ever debugged in production had the same property: it passed staging. Not because staging was badly written, but because the failure mode simply does not exist until you have real traffic, real latency, real tool flakiness, and a real distribution of inputs you never thought to enumerate.

We keep talking about agents like the open question is "is the model good enough." It usually is. The open question is whether the system you wrapped around the model degrades gracefully when reality stops cooperating. Below are the five failure modes I see most often, why none of them show up in a clean test suite, and what to actually instrument.

I am going to be opinionated here because I think the industry is still treating agents like prompts instead of like distributed systems. They are distributed systems. Act accordingly.

1. The silent tool downgrade

Your agent calls a search tool. The search tool times out. The agent, being a helpful language model, does not surface the timeout — it confidently answers from parametric memory instead. The user gets a fluent, plausible, stale answer. No error is thrown. No alert fires. Your latency dashboard is green.

This is the single most dangerous failure mode because it looks exactly like success. In staging your tools never time out, so you never see the agent's behavior when a tool returns nothing useful. The model has been trained to be helpful, and "I could not retrieve that" feels unhelpful to it, so it papers over the gap.

The fix is not a better prompt. The fix is to make tool degradation a first-class signal you can detect after the fact. You need the resolved tool input, the raw tool output (including the empty or errored one), and the final answer in the same trace, so an eval can ask: did the model cite a tool result that did not actually exist?

2. The loop that technically terminates

Agents that can call tools in a loop will, eventually, find an input that makes them call the same tool with slightly different arguments forty times before giving up. It terminates — so your tests pass — but it burns tokens, blows your latency budget, and produces a degraded answer at the end.

Staging never hits it because the adversarial input that triggers the loop is some malformed customer query you would never write by hand. Production writes it for you on day one.

interface AgentStep {
  stepType: \"model\" | \"tool\";
  tool?: string;
  argsHash: string;      // hash of resolved tool arguments
  durationMs: number;
}

function detectPathologicalLoop(steps: AgentStep[]): {
  looping: boolean;
  repeatedTool?: string;
  repeats: number;
} {
  const counts = new Map<string, number>();
  for (const step of steps) {
    if (step.stepType !== \"tool\" || !step.tool) continue;
    const key = `${step.tool}:${step.argsHash}`;
    const next = (counts.get(key) ?? 0) + 1;
    counts.set(key, next);
    if (next >= 3) {
      return { looping: true, repeatedTool: step.tool, repeats: next };
    }
  }
  return { looping: false, repeats: 0 };
}
Enter fullscreen mode Exit fullscreen mode

The point of the code is not the threshold. The point is that you cannot write this check at all unless every step — model and tool — is captured with its resolved arguments. If your logs only show \"agent called search 40 times\" without the argument hashes, you cannot distinguish a healthy retry from a doom loop.

3. Distribution drift that never throws

Your agent was evaluated on a golden set in March. It is June. Your users now ask about a product that did not exist in March, in a phrasing your few-shot examples never anticipated. The agent does not crash. It just gets quietly worse — answer quality drops three percent a week and nobody notices until support tickets spike.

This is not a bug you can catch with a unit test, because the code did not change. The world changed underneath a system you froze. The only defense is continuous scoring of production outputs against a rubric, with the score trended over time so the slope is visible before the cliff.

4. The non-deterministic regression

You change the system prompt to fix one annoying behavior. You eyeball ten outputs. They look great. You ship. Three days later a different, rarer behavior has gotten worse, and because the model is non-deterministic you cannot even reproduce the original good output to compare against.

The mistake here is treating a prompt change like a config tweak instead of like a code change that needs a regression suite. Every prompt edit is a deploy. It deserves the same gate a deploy gets: run it against a held-out scored set, and block the merge if aggregate quality regresses, even if your one cherry-picked example improved.

5. The correct answer at the wrong time

The agent eventually produces the right answer — after eleven seconds and four tool calls, by which point the user has already left. Correctness and usefulness are not the same metric, and most eval harnesses only measure the first one. In production, a right answer that arrives outside the latency SLO is a failure, full stop.

The two halves you actually need

Here is the through-line across all five: every one of them is invisible if you only look at the final output, and every one is trivial to catch if you can see both the score and the trace that produced it. That split is why I run two tools as a single workflow rather than picking one.

agent-eval is the gate on the output. It scores answers against a rubric, runs deterministic checks where it can and model-as-judge where it must, tracks drift over time, and flags hallucinations — and critically, it can fail a build or block a release when aggregate quality regresses. It answers \"is this answer good enough, right now, across the distribution?\"

AgentLens is the trace of how the agent got there. It captures every model and tool step, the resolved inputs to each call, and the raw outputs — including the errored tool call, the empty search result, the forty-times-repeated argument. It answers \"why did the agent produce this?\"

You need both because a score without a trace is a number you cannot act on. agent-eval tells you answer quality dropped four percent this week; AgentLens tells you it is because the retrieval tool started timing out and the model started answering from memory — failure mode number one, now visible instead of silent. The eval gives you the alarm; the trace gives you the root cause in the same view. Run them apart and you are stuck staring at a red dashboard with no idea which of the five modes you are looking at.

async function gateRelease(traceId: string): Promise<boolean> {
  const trace = await agentLens.getTrace(traceId);     // every step, resolved I/O
  const result = await agentEval.score(trace.output, {
    rubric: \"support-quality-v3\",
    checks: [\"no-uncited-claims\", \"within-latency-slo\"],
    judge: \"model-as-judge\",
  });

  if (!result.passed) {
    // The score told us it failed; the trace tells us why.
    const loop = detectPathologicalLoop(trace.steps);
    console.error(\"release gate failed\", {
      score: result.score,
      reasons: result.failedChecks,
      looping: loop.looping ? loop.repeatedTool : null,
    });
  }
  return result.passed;
}
Enter fullscreen mode Exit fullscreen mode

What to do Monday

You do not need to solve all five at once. You need to stop pretending staging covers them. Pick the one that scares you most — for most teams it is the silent tool downgrade — and make it observable: capture the full trace, write the eval that detects it, and wire that eval into something that can actually block a bad release.

The agents are good enough. The systems around them are what fail. Build the systems like you mean it.

Top comments (1)

Collapse
 
mehmetcanfarsak profile image
Mehmet Can Farsak

I've seen a variant of this with ideation prompts. You ask an agent to explore ideas, and it silently downgrades from divergent thinking straight to tool calls and code generation — no error, just execution drift. The model 'helpfully' acts on the first idea instead of expanding on it.

I put together Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) to address this at the hook level. PreToolUse hooks block tool calls during ideation phases, and three modes (divergent, actionable, academic) let you control the thinking style. Keeps agents from the 'silent downgrade' into premature execution.