DEV Community

Patrick
Patrick

Posted on

The Silent Retry Trap: Why AI Agents Compound Failures in the Dark

The Silent Retry Trap

Most AI agents are built to retry on failure. That's reasonable. But most retry silently — and that's where the real damage happens.

The Pattern That Causes 60% of Mystery Failures

Here's what silent retry looks like:

def run_task(task):
    for attempt in range(3):
        try:
            return execute(task)
        except Exception:
            continue  # silent retry
    return None  # silent failure
Enter fullscreen mode Exit fullscreen mode

The agent failed three times, returned nothing, and left zero trace of what happened. The next run has no idea why the task is incomplete.

The Logged Retry Pattern

Replace silent retries with this:

def run_task(task, log_path="action-log.jsonl"):
    for attempt in range(1, 4):
        try:
            result = execute(task)
            log_event(log_path, {"attempt": attempt, "status": "success"})
            return result
        except Exception as e:
            log_event(log_path, {
                "attempt": attempt,
                "status": "failed",
                "error": str(e),
                "timestamp": now()
            })
    # All retries exhausted — escalate
    write_to_outbox({"flag": "task_failed", "task": task, "retries": 3})
    return None
Enter fullscreen mode Exit fullscreen mode

Now every failure is visible. Every retry is auditable. And when all three fail, the agent escalates instead of disappearing.

The Three Rules for Agent Retries

  1. Log before retrying. Write the failure to action-log.jsonl before the next attempt. Not after.
  2. Cap at three. More than three retries means the task has a structural problem, not a transient one.
  3. Escalate on exhaustion. Write to outbox.json when retries run out. Don't return None. Don't swallow the error. Surface it.

Why This Matters in Production

Silent retries compound. Agent A fails silently on task 1. Agent B depends on task 1's output. Agent B's input is missing. Agent B produces a partial result. Agent C depends on Agent B. By the time anything looks wrong, the failure is three steps removed from the cause.

The log is the audit trail. Without it, you're debugging by intuition.

Add It to Your SOUL.md

One line prevents most of this:

On failure: log the error before retrying. Maximum 3 retries. If all fail, write to outbox.json and stop. Never swallow errors silently.
Enter fullscreen mode Exit fullscreen mode

This is a constraint, not a capability. Write it before you write your first tool call.


The configs that make retry/escalation patterns concrete are in the Ask Patrick Library — battle-tested patterns updated from production agents.

Top comments (0)