DEV Community

zvone187
zvone187

Posted on • Originally published at blog.pazi.ai

5 silent failure modes in production AI agents (and how we instrument for them)

AI agents fail differently than apps. The failure rarely lives in the work itself. It lives in the seams: the delivery step, the tool call, the inbound routing decision, the bootstrap that ate the budget. None of those surface as exceptions, so APM dashboards say "green" while the user sees nothing. Here are five failure modes that show up that way, and how we instrument or defend against each.

Where each silent failure lives in the agent pipeline

1. Crons that "succeed" but never deliver

The cron framework doesn't know what the user received; it knows what the agent reported per-run. Our runtime persists lastDeliveryStatus as a three-state field ("delivered" | "not-delivered" | "not-requested"), but those states are the agent's self-report. A run that creates its side effects and then runs out of budget before the announce step still serializes the side effects as done. We saw this concretely with one of our bug-triage crons: on a 300-second timeout with about 75 seconds eaten by bootstrap, the agent successfully created a GitHub issue, opened a Jira ticket, and updated a sheet, then hit waitedMs=298401 and got cut before the Slack announce step ever ran. The framework recorded the run as delivered, but no Slack message was ever sent.

Our Sentry log transport matches cron: ... ERROR lines from the cron service and forwards them as Sentry events tagged with the component, the job id, the cron name, the run id, and the error count, so a timeout that used to be a buried log line becomes an event you can search by job. The deeper fix from that same incident isn't observability though, it's sequencing: user-facing announces have to run before cleanup so the announcement gets budget priority. We come back to that in failure 5.

2. Tool calls that 4xx silently

When a tool wrapper returns a generic empty string on a non-2xx, or a soft "operation could not be completed" message that doesn't look like a real error, the model has no signal to act on. It reads the response, treats it as a valid no-op, and moves on to the next step. The end-of-run summary says everything succeeded, the audit trail shows the tool was called, and nothing in either trace says the call actually failed.

What we changed is making those failures loud at the runtime layer. The runtime now matches [tools] NAME failed: reason … ERROR patterns from the tool adapter and forwards them to Sentry as captureException events tagged with the tool name, which means a tool's failure rate becomes a metric you can see at a glance instead of something you'd have to grep debug logs for. We also promoted the Slack send-side silent-drop logs from verbose to info, which catches the boring-but-important half of this: rate limits and permission errors at the send boundary that used to be invisible unless you knew to look.

3. Channels that suppress inbound messages without telling anyone

When an agent stops responding to DMs, the bug is almost never in the agent. The inbound handler sits between the platform and the agent and decides for every event whether to route it or drop it, and most of its drops are the right call: a bot self-mention shouldn't loop, a message edit shouldn't trigger a fresh run, a thread the agent isn't configured for shouldn't get a reply. The problem is that the handler doesn't tell anyone when it drops something, so a wrong drop and a successful "agent is healthy" look identical from the outside. The user waits, the runtime stays green, and there's nothing in the trace that says a message was even seen.

We changed that with two pieces of structured visibility on the inbound side. The fifteen silent-drop log lines that used to live at verbose now emit at info, so suppressions show up in normal runtime logs without anyone flipping a debug flag. On top of that, every suppression goes through a structured event transport that tags it with one of fourteen canonical reason codes (no-mention, channel-not-allowed, dm-not-authorized, and the rest) alongside the original log line. When someone says "the agent missed my DM," the answer is in those events: grep the message id, get a definitive routing reason, stop guessing.

4. Reasoning leakage in Slack threads

A different shape of silent failure: the agent posts a message in a Slack thread that reads Now calling message(action=send, channel=#alerts) to post the alert, and the alert never goes anywhere because the model narrated the tool call as plain text instead of issuing it. From the user's side it looks like the agent did the work, from the runtime's side no side-effecting tool ran in that turn, and the model itself treated the narration as the work and continued.

This one isn't catchable as an exception, so two layers defend against it at different points in the pipeline. At the prompt layer, the operating contract includes a rule that says "Default: do not narrate routine, low-risk tool calls (just call the tool)." Between tool calls the agent should either deliver a polished user-facing message or stay silent, never narrate the call. At the delivery layer, the channel dispatchers strip reasoning tags before any message lands in Slack, Discord, or Telegram, so even when the model narrates internally the surface the user sees stays clean. We don't yet have an observability backstop for this. No runtime check scans assistant turns for tool-call syntax and fires an event when the pattern hits, and that's a future hardening item we know we owe.

5. Bootstrap latency eating the timeout

Bootstrap is real wall-clock time on the same budget as the work, and treating it as free overhead is how a 300-second timeout quietly becomes a 225-second one. In the bug-triage cron incident from failure 1, memory load, credential resolution, and skill scan together took about 75 seconds before the agent did anything productive. The work expanded to fill what was left, and the user-facing announcement, always last in the chain, lost the race when the runtime cut the run at 298 seconds. Nothing threw, so the completion handler logged success normally. Nothing logged that the final tool call had been cut off mid-emission.

This one is a lesson with concrete numbers, not a feature we shipped. Measure your bootstrap from log timestamps the first time it bites, then size your timeout as observed-bootstrap plus observed-work plus a buffer instead of as one flat number. Sequence user-facing announces before cleanup steps so the side effect that actually matters to the user gets the budget priority. The only Sentry visibility on this today is downstream, after the fact: when a cron times out, the resulting error log goes to Sentry tagged with the job id, so the timeout itself is searchable. We don't have any of the standard observability scaffolding for the bootstrap-vs-work split yet. No histogram of bootstrap times, no per-run milestone event, nothing that alerts on a starved work phase. That's how this stays a sizing rule for now, not a tracked metric.

What this adds up to

The five failures share a fault line: the agent's internal trace says everything is fine, the user's experience says nothing happened, and the dashboard sides with the agent. APM was built for a world where exceptions are the failure signal and a green run means the user got served, but agents don't fit that world. Agent observability has to assume the user is absent, and the instrumentation has to ask whether the side effect actually happened as a first-class question instead of deriving it from whether the run finished.

The same run, two views: internal trace shows success, user's Slack stays empty

Our runtime forwards cron and tool errors into Sentry as tagged events, and our Slack inbound handler emits structured suppression events on the same path, so the cron, tool, and suppression failures all become things you can search by name. The other two are harder. Reasoning leakage needs runtime hygiene at the prompt and dispatcher layers, not exception capture. Bootstrap-eats-the-timeout needs budget arithmetic and sequencing, not a Sentry tag. Some silent failures don't resolve to "add another event," and that's part of operating agents in production. There are more failure modes we haven't written up yet.

Pazi is an agent platform that takes production observability seriously β€” every cron, tool call, and runtime error is captured. Build your first agent at pazi.ai.

Top comments (0)