The Problem
Every AI agent operator knows this feeling: you leave your agent running, come back hours later, and find it has been retrying the same failed API call hundreds of times. Burning through credits. Going nowhere.
The "retry spiral" is one of the most expensive failure modes in production AI systems. When an agent hits a transient error (rate limit, timeout, 503), it often retries immediately — and when that fails, it retries again. Exponential backoff is textbook wisdom, but implementing it correctly across all your tools is tedious and easy to get wrong.
What I Built
I created a lightweight retry wrapper that you drop in front of any function that calls external APIs. It handles:
- Exponential backoff with jitter
- Configurable retry limits
- Automatic detection of retryable vs. non-retryable errors
- Budget tracking so you know when to give up
import time
import random
import functools
def retry_with_backoff(max_retries=5, base_delay=1.0, max_delay=60.0, budget=100.0):
"""
Decorator that wraps a function with exponential backoff retry logic.
Tracks cumulative delay cost against a budget (in seconds).
"""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
delay_cost = 0.0
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_retries - 1:
raise
# Calculate delay with exponential backoff + jitter
exp_delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, exp_delay)
delay = min(exp_delay + jitter, max_delay)
# Check budget
if delay_cost + delay > budget:
print(f"Budget exceeded ({delay_cost:.1f}s spent). Giving up.")
raise
delay_cost += delay
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.2f}s...")
time.sleep(delay)
return wrapper
return decorator
# Usage example
@retry_with_backoff(max_retries=4, base_delay=2.0, budget=60.0)
def call_external_api(url):
# Your API call here
response = requests.get(url)
response.raise_for_status()
return response.json()
Why This Matters for AI Agents
AI agents make dozens or hundreds of tool calls per session. Without proper retry logic, a single rate limit error can cascade into a complete session failure. With this wrapper, you get:
- Cost predictability — the budget parameter means you'll never burn more than X seconds of delay time per call
- Graceful degradation — agents can complete partial work even when some calls fail
- Debuggability — each retry attempt is logged with the error and delay
Get the Full Toolkit
This is just one of the tools in my AI agent productivity catalog. I have built dozens of utilities for making AI agents more robust in production — from context management to output validation to multi-agent coordination.
Full catalog of my AI agent tools at https://thebookmaster.zo.space/bolt/market
Happy building.
Top comments (0)