My Agent Said the Tool Call Succeeded. It Had 404'd.

#ai #agents #claudecode #debugging

I was debugging a GitHub issue in an open-source orchestrator repo: "subagent MODEL_CALL_FAILED gets swallowed." The reporter's repro was simple — point a subagent at an unresolvable model slug (a typo'd OpenRouter id), run the workflow, and watch the parent orchestrator report the subagent finished successfully. Empty output, green checkmark, no error anywhere in the logs.

That's the scary part. Not that a model call failed — calls fail all the time, 404s, invalid keys, rate limits. The scary part is that failure got translated into a success by the time it reached the part of the system that decides whether to trust the result.

Finding where the lie started

The orchestrator's tool loop classifies the outcome of every model call into one of a few buckets: success, retryable, terminal. A terminal classification means "this isn't coming back, stop retrying." Here's roughly the shape of the code before the fix:

// tool-loop.ts
if (classification === "terminal") {
  return { done: true, output: "" };
}

if (classification === "retryable" && attempt >= maxRetries) {
  if (config.mode === "task") {
    return { done: true, output: "", isError: true };
  }
  return { done: true, output: "" };
}

See the asymmetry? The retryable-exhausted branch checks config.mode === "task" and sets isError: true when this is a subagent task, not a plain chat turn. The terminal branch — the one hit by a 401, a 403, a 404, an invalid model id — does neither. It always returns success-shaped output, no matter what mode you're in.

Downstream, workflow-entry.ts's finalizeDone takes that { done: true, output: "" } and rolls it up into the parent as a normal subagent-result. Nothing in the shape of that object says "this failed." It looks exactly like a subagent that ran, did nothing, and returned nothing — which is a legitimate outcome for some tasks. The orchestrator has no way to tell the difference between "nothing to do" and "the model never responded."

The fix was one line, finding it wasn't

if (classification === "terminal") {
  if (config.mode === "task") {
    return { done: true, output: "", isError: true };
  }
  return { done: true, output: "" };
}

Same check, applied to the branch that was missing it. A terminal failure in task mode now surfaces as isError: true, which the parent orchestrator already knows how to handle — it shows up as a failed subagent instead of a silent empty success.

The actual patch took ten minutes. Confirming it was the bug took longer: I wrote a regression test first, watched it fail red against the original code (confirming the swallow was real and reproducible, not a red herring), applied the fix, watched it go green, then ran the full tool-loop.test.ts suite plus the broader package suite to make sure nothing else depended on the old behavior. Nothing did — the missing check had no test coverage on the terminal path at all, which is exactly how it slipped in.

Why this generalizes past this one repo

Any orchestrator with a "terminal vs retryable" distinction has this exact class of bug available to it: someone adds the isError flag when handling retry-exhaustion, forgets to also add it to the sibling terminal-failure branch, and now your system has two ways to fail — one loud, one silent. The silent one is worse than a crash, because a crash gets noticed. A silent success gets trusted, composed into other decisions, and shipped.

This is also why I don't let an agent's own summary of what it did stand as the sole record. When I dispatch a subagent for a task, "I finished the migration" is a claim, not evidence. The claim describes what the subagent intended to do; it doesn't prove the tool calls underneath it actually returned what it thinks they returned. If a model call inside that subagent had 404'd on a terminal path with the swallow bug still present, its final summary would confidently say "done" — because from its own vantage point, nothing told it otherwise.

The practical habit I've settled on:

Treat "terminal" and "retry-exhausted" as the same severity when it comes to error propagation. If one path sets an error flag, the sibling path handling the other failure mode needs the identical check — write the test for both, not just the one that prompted the fix.
Before trusting any multi-step agent result, ask what the underlying tool calls returned, not what the agent said about them. A repro-first regression test (red before the fix, green after) is the only thing that actually confirms the mechanism, versus a plausible-sounding story about what probably went wrong.
If your system has more than one "this failed" code path, grep for every place that checks isError or its equivalent, and verify each failure classification reaches it. It's cheap to check once and expensive to find out in production that one branch never did.

Silent success is the most dangerous failure mode an agent can have, because nothing downstream knows to distrust it.

Top comments (1)

Skillselion • Jul 6

The asymmetry you found is the version of this bug that keeps coming back: two exit branches, only one carries the error flag. What makes it so easy to reintroduce is that the empty-success shape is indistinguishable from a legitimate no-op, so nothing downstream asserts on it. The structural fix that has held up for me is to make done-with-no-error carry a reason enum rather than an empty string, so the terminal and retry-exhausted paths cannot both collapse to the same object. Then the parent can branch on reason, and nothing-to-do stops being the same value as model-never-answered. Your instinct to write the red test against the terminal path first is the part most people skip, and it is exactly why that branch had no coverage to begin with.