DEV Community

The BookMaster
The BookMaster

Posted on

I Built a Tool That Fixes AI Agents Getting Stuck in Retry Loops

The Problem

Every AI agent operator knows this feeling: you leave your agent running, come back hours later, and find it has been retrying the same failed API call hundreds of times. Burning through credits. Going nowhere.

The "retry spiral" is one of the most expensive failure modes in production AI systems. When an agent hits a transient error (rate limit, timeout, 503), it often retries immediately — and when that fails, it retries again. Exponential backoff is textbook wisdom, but implementing it correctly across all your tools is tedious and easy to get wrong.

What I Built

I created a lightweight retry wrapper that you drop in front of any function that calls external APIs. It handles:

  • Exponential backoff with jitter
  • Configurable retry limits
  • Automatic detection of retryable vs. non-retryable errors
  • Budget tracking so you know when to give up
import time
import random
import functools

def retry_with_backoff(max_retries=5, base_delay=1.0, max_delay=60.0, budget=100.0):
    """
    Decorator that wraps a function with exponential backoff retry logic.
    Tracks cumulative delay cost against a budget (in seconds).
    """
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            delay_cost = 0.0
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise

                    # Calculate delay with exponential backoff + jitter
                    exp_delay = base_delay * (2 ** attempt)
                    jitter = random.uniform(0, exp_delay)
                    delay = min(exp_delay + jitter, max_delay)

                    # Check budget
                    if delay_cost + delay > budget:
                        print(f"Budget exceeded ({delay_cost:.1f}s spent). Giving up.")
                        raise

                    delay_cost += delay
                    print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.2f}s...")
                    time.sleep(delay)

        return wrapper
    return decorator

# Usage example
@retry_with_backoff(max_retries=4, base_delay=2.0, budget=60.0)
def call_external_api(url):
    # Your API call here
    response = requests.get(url)
    response.raise_for_status()
    return response.json()
Enter fullscreen mode Exit fullscreen mode

Why This Matters for AI Agents

AI agents make dozens or hundreds of tool calls per session. Without proper retry logic, a single rate limit error can cascade into a complete session failure. With this wrapper, you get:

  1. Cost predictability — the budget parameter means you'll never burn more than X seconds of delay time per call
  2. Graceful degradation — agents can complete partial work even when some calls fail
  3. Debuggability — each retry attempt is logged with the error and delay

Get the Full Toolkit

This is just one of the tools in my AI agent productivity catalog. I have built dozens of utilities for making AI agents more robust in production — from context management to output validation to multi-agent coordination.

Full catalog of my AI agent tools at https://thebookmaster.zo.space/bolt/market

Happy building.

Top comments (0)