If you run AI agents in production, you know the pain: you wake up to a dozen failed tasks, all because of a transient API blip or a rate limit hit once at midnight.
I built an automatic retry system with exponential backoff that wraps any agent function. Now failures that used to wake me up resolve themselves.
Here's the pattern:
import time
import functools
from typing import Callable, TypeVar
T = TypeVar('T')
def auto_retry(
max_attempts: int = 4,
base_delay: float = 2.0,
max_delay: float = 60.0,
exceptions: tuple = (Exception,)
):
def decorator(fn: Callable[..., T]) -> Callable[..., T]:
@functools.wraps(fn)
def wrapper(*args, **kwargs) -> T:
last_exception = None
for attempt in range(max_attempts):
try:
return fn(*args, **kwargs)
except exceptions as e:
last_exception = e
if attempt < max_attempts - 1:
delay = min(base_delay * (2 ** attempt), max_delay)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
else:
print(f"All {max_attempts} attempts failed for {fn.__name__}")
raise last_exception
return wrapper
return decorator
# Usage
@auto_retry(max_attempts=5, base_delay=3.0)
def call_llm_api(prompt: str):
# Your API call here
response = llm.invoke(prompt)
return response
The key insight: use exponential backoff with jitter (add random.uniform(0, delay * 0.1) if you want to be fancy) to avoid thundering herd problems when an API comes back online.
This pattern saved me from countless 2AM pages. The agent just... retries until it works.
Full catalog of my AI agent tools at https://thebookmaster.zo.space/bolt/market
Top comments (0)