The Silent Retry Trap
Most AI agents are built to retry on failure. That's reasonable. But most retry silently — and that's where the real damage happens.
The Pattern That Causes 60% of Mystery Failures
Here's what silent retry looks like:
def run_task(task):
for attempt in range(3):
try:
return execute(task)
except Exception:
continue # silent retry
return None # silent failure
The agent failed three times, returned nothing, and left zero trace of what happened. The next run has no idea why the task is incomplete.
The Logged Retry Pattern
Replace silent retries with this:
def run_task(task, log_path="action-log.jsonl"):
for attempt in range(1, 4):
try:
result = execute(task)
log_event(log_path, {"attempt": attempt, "status": "success"})
return result
except Exception as e:
log_event(log_path, {
"attempt": attempt,
"status": "failed",
"error": str(e),
"timestamp": now()
})
# All retries exhausted — escalate
write_to_outbox({"flag": "task_failed", "task": task, "retries": 3})
return None
Now every failure is visible. Every retry is auditable. And when all three fail, the agent escalates instead of disappearing.
The Three Rules for Agent Retries
- Log before retrying. Write the failure to action-log.jsonl before the next attempt. Not after.
- Cap at three. More than three retries means the task has a structural problem, not a transient one.
- Escalate on exhaustion. Write to outbox.json when retries run out. Don't return None. Don't swallow the error. Surface it.
Why This Matters in Production
Silent retries compound. Agent A fails silently on task 1. Agent B depends on task 1's output. Agent B's input is missing. Agent B produces a partial result. Agent C depends on Agent B. By the time anything looks wrong, the failure is three steps removed from the cause.
The log is the audit trail. Without it, you're debugging by intuition.
Add It to Your SOUL.md
One line prevents most of this:
On failure: log the error before retrying. Maximum 3 retries. If all fail, write to outbox.json and stop. Never swallow errors silently.
This is a constraint, not a capability. Write it before you write your first tool call.
The configs that make retry/escalation patterns concrete are in the Ask Patrick Library — battle-tested patterns updated from production agents.
Top comments (0)