DEV Community

zhongqiyue
zhongqiyue

Posted on

How I Stopped Hitting AI API Rate Limits (And Kept My Sanity)

The First Time I Wanted to Throw My Laptop Out the Window

It was 2 AM. I had just finished wiring up an AI text generation feature into my side project. The demo worked beautifully—three prompts in a row, perfect responses. I smiled, closed my laptop, and went to bed.

The next day, I ran a batch of 100 prompts for a blog post generator. Within 30 seconds, everything broke. Rate limit errors flooded my console. Some calls returned 429, others 503. A few just hung forever. The entire pipeline ground to a halt, and I spent the next three hours manually retrying requests, rewriting error handlers, and questioning my life choices.

Sound familiar? If you’ve ever integrated any AI API—OpenAI, Anthropic, or even a smaller provider—you’ve probably hit this wall. The APIs are incredible, but they’re not bulletproof. They have rate limits, they have transient errors, and they have timeout quirks.

Today I want to share the approach that finally saved me: a resilient, generic retry-and-backoff pattern that works with almost any AI API. No magic, just solid code.

What I Tried That Didn’t Work

1. Simple time.sleep() after errors

import requests

def call_ai_api(prompt):
    response = requests.post("https://api.example.com/generate", json={"prompt": prompt})
    if response.status_code == 429:
        time.sleep(10)  # Wait 10 seconds and hope
        response = requests.post(...)  # Same call again
    return response.json()
Enter fullscreen mode Exit fullscreen mode

This worked... sometimes. But it didn’t handle multiple consecutive failures, and waiting a fixed 10 seconds wasted time when the API recovered quickly.

2. Manual exponential backoff with a for loop

def call_with_backoff(prompt, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = requests.post(...)
            return response.json()
        except requests.exceptions.RequestException:
            wait = 2 ** attempt  # Exponential
            print(f"Retry {attempt+1}, waiting {wait}s")
            time.sleep(wait)
    raise Exception("All retries failed")
Enter fullscreen mode Exit fullscreen mode

Better, but still fragile. It didn’t distinguish between different error types (e.g., 401 vs 429), and it didn’t add jitter—so if multiple clients retried at the same time, they’d all hammer the API in sync. Also, handling timeouts correctly required extra logic.

I needed something production-grade without reinventing the wheel.

What Eventually Worked: The tenacity Library

I discovered tenacity, a Python library that handles retries with elegance. It lets you define retry policies with minimal code, including exponential backoff, jitter, stop conditions, and even custom retry callbacks.

Here’s the pattern I now use for any AI API call:

import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, before_sleep_log
import logging

logger = logging.getLogger(__name__)

# Define which errors are retryable
RETRYABLE_STATUSES = {429, 500, 502, 503, 504}

class RetryableAPIError(Exception):
    pass

def make_request(url, payload, headers):
    response = requests.post(url, json=payload, headers=headers, timeout=30)
    if response.status_code in RETRYABLE_STATUSES:
        raise RetryableAPIError(f"API returned {response.status_code}: {response.text[:100]}")
    response.raise_for_status()  # Raise for non-retryable errors (4xx except 429)
    return response.json()

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    retry=retry_if_exception_type(RetryableAPIError),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    reraise=True
)
def call_ai_api_with_retry(prompt, api_key, endpoint_url):
    # Example using a generic AI API — adapt to your provider
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    payload = {
        "prompt": prompt,
        "max_tokens": 200,
        "temperature": 0.7
    }
    return make_request(endpoint_url, payload, headers)
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • Exponential backoff with jitter (tenacity adds jitter internally by default) spreads out retries, avoiding thundering herd.
  • Only retries specific errors (like 429 or 5xx). Permanent failures (e.g., 401 bad key) fail fast.
  • Logging between retries tells me what’s happening without spamming.
  • Timeout on the request (30s) prevents hanging on unresponsive endpoints.
  • Configurable max attempts – 5 is usually enough. If it fails after 5, something is fundamentally wrong.

I also added a context manager to handle different APIs:

from contextlib import contextmanager

@contextmanager
def resilient_ai_session(api_key, base_url, timeout=30):
    session = requests.Session()
    session.headers.update({"Authorization": f"Bearer {api_key}"})
    try:
        yield session
    finally:
        session.close()

# Usage:
with resilient_ai_session(API_KEY, "https://ai.interwestinfo.com/v1") as session:
    result = call_ai_api_with_retry(session, prompt)
Enter fullscreen mode Exit fullscreen mode

Yes, the URL above is a real example—I’ve used a similar setup for a self-hosted AI endpoint. But the technique works for any provider.

Lessons Learned

  1. Don’t ignore jitter. A few milliseconds of randomness can save your entire backend from collapsing under synchronized retries.
  2. Log everything retry-related. You’ll thank yourself when debugging a silent failure at 3 AM.
  3. Separate retryable vs. non-retryable errors. A 401 or 403 means your credentials are wrong—retrying is pointless and could get you banned.
  4. Consider a circuit breaker for heavy load scenarios. For batch processing, tenacity’s retry is fine, but if you’re making millions of calls, look into pybreaker or something similar.
  5. Test with a mock server. Use responses or moto to simulate 429s before you hit production.

What I’d Do Differently Next Time

If I were starting from scratch, I’d build this retry logic into a custom client class from day one, rather than sprinkling try/except blocks everywhere. Also, I’d use asyncio with tenacity.async for concurrent calls—but that’s another article.

One more thing: rate limiting is not just about retries. You also want to throttle your own requests to stay under the API’s limits. I eventually added a token bucket (using ratelimit library) so I never sent more than, say, 60 requests per minute. Pairing rate limiting with retry logic is the real secret sauce.

Your Turn

The next time you integrate any AI API, don’t write your own retry logic from scratch. Use a battle-tested library, define your retryable errors, add exponential backoff with jitter, and log the retries. Your future self will thank you.

What’s your go-to pattern for handling flaky APIs? Have you found a better approach? Let me know in the comments—I’m always looking to improve this setup.

Top comments (0)