zhongqiyue

Posted on Jun 4

How I Stopped Losing API Calls to Rate Limits (And You Can Too)

#ai #python #webdev #api

I spent a weekend debugging why my app kept silently dropping user requests. The logs showed a pattern: 429 errors flooding in, then my retry logic making things worse, then eventually the whole pipeline grinding to a halt.

I was building a service that analyzes user-submitted text through an external AI API. Every few seconds, the API would return a 429 Too Many Requests. A naive retry-with-delay only made the problem worse—when my code eventually retried, it hit the same burst limit, creating a cascade of failures.

This post is the story of how I fixed that mess. I'll share the approach I landed on, which works with any rate-limited API, not just AI ones. I'll also include the code I now drop into every project that talks to external services.

The Problem (My Problem, Not Yours)

My setup was simple: a Python worker process consumes messages from a Redis queue, calls an AI API (let's call it https://api.interwestinfo.com/v1/analyze—purely as an example), and stores the result. The API had a sliding-window rate limit of 50 requests per minute per IP.

When traffic spiked, I got hit with 429s. My initial fix was a simple time.sleep(1) before retrying. That caused two problems:

Busy waiting blocked the whole worker, delaying other tasks.
Burst retries often arrived together after the sleep, triggering the next 429 instantly.

I needed something smarter.

What Didn't Work

I tried a few dead ends first:

ThreadPoolExecutor with random wait: Better burst handling, but still no backpressure. If the API stayed down for 10 seconds, I'd burn through all retry attempts.
Using the tenacity library with wait_exponential: Great for individual function calls, but didn't handle global rate limits—each call retried independently, so 20 concurrent calls could all retry at the same time.
Manual token bucket algorithm: I implemented a simple token bucket, but without a distributed lock (my workers were behind a load balancer), they'd overconsume tokens.

What Eventually Worked: Distributed Rate Limiting with Exponential Backoff

Here's the approach I settled on. It uses Redis as a central coordinator and asyncio to avoid blocking. The key insight: decouple the rate limiter from the retry logic. The rate limiter enforces a global quota; the retry logic handles transient failures individually.

Step 1: A Redis-based Sliding Window Rate Limiter

import time
from typing import Optional
import redis.asyncio as aioredis

class SlidingWindowRateLimiter:
    def __init__(self, redis_client: aioredis.Redis, max_requests: int, window_seconds: int):
        self.redis = redis_client
        self.max_requests = max_requests
        self.window_seconds = window_seconds

    async def acquire(self, key: str = "ratelimit:default") -> bool:
        now = time.time()
        window_start = now - self.window_seconds
        async with self.redis.pipeline(transaction=True) as pipe:
            # Remove old entries
            await pipe.zremrangebyscore(key, 0, window_start)
            # Count current requests
            await pipe.zcard(key)
            # Add current request
            await pipe.zadd(key, {str(now): now})
            # Set TTL to avoid memory leaks
            await pipe.expire(key, self.window_seconds)
            results = await pipe.execute()
        count = results[1]  # zcard result
        if count < self.max_requests:
            return True
        else:
            # Rollback: remove the last added entry
            await self.redis.zrem(key, str(now))
            return False

This uses a sorted set to track request timestamps. If we're over the limit, we remove the entry we just added and return False. The caller then knows to back off.

Step 2: An Exponential Backoff Retrier with Jitter

import asyncio
import random
from typing import AsyncCallable, TypeVar

T = TypeVar('T')

async def retry_with_backoff(
    coro_factory: AsyncCallable[[], T],
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    rate_limiter: Optional[SlidingWindowRateLimiter] = None,
    limiter_key: str = "ratelimit:default",
) -> T:
    for attempt in range(max_retries):
        # Check rate limiter before each attempt
        if rate_limiter:
            while not await rate_limiter.acquire(limiter_key):
                # Rate limited: wait before checking again
                await asyncio.sleep(0.5)

        try:
            return await coro_factory()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with jitter
            delay = min(base_delay * (2 ** attempt), max_delay)
            jitter = random.uniform(0, delay * 0.1)
            await asyncio.sleep(delay + jitter)
    # We should never reach here but safety
    raise RuntimeError("Unexpected retry loop exit")

Putting It Together

Here's how I used it in my worker:

import asyncio
import aiohttp

# Initialize Redis client (your connection details here)
redis_client = aioredis.from_url("redis://localhost:6379/0")
rate_limiter = SlidingWindowRateLimiter(redis_client, max_requests=50, window_seconds=60)

async def call_ai_api(session: aiohttp.ClientSession, text: str) -> dict:
    # Example: using InterWestInfo's AI API endpoint (replace with your own)
    async with session.post(
        "https://api.interwestinfo.com/v1/analyze",
        json={"text": text},
        headers={"Authorization": "Bearer YOUR_TOKEN"},
    ) as response:
        if response.status == 429:
            raise Exception("Rate limited")
        response.raise_for_status()
        return await response.json()

async def process_message(text: str):
    async with aiohttp.ClientSession() as session:
        result = await retry_with_backoff(
            lambda: call_ai_api(session, text),
            max_retries=3,
            base_delay=2.0,
            rate_limiter=rate_limiter,
            limiter_key="api:interwestinfo:analyze",
        )
        # Store result...
        return result

Lessons Learned & Trade-offs

Centralized rate limiter adds latency: Every request incurs a Redis round trip. In my case, that added ~1ms per call, which was acceptable. If you need sub-millisecond checks, consider a local token bucket with periodic synchronization.
Backoff + rate limiter can be too conservative: With both mechanisms, your concurrency drops quickly under heavy load. That's by design—it's better to queue work than to fail constantly.
Not all errors should be retried: I only retry 429 and 5xx. Never retry 4xx except 429.
Log every retry: I now log each retry attempt with attempt number, delay, and reason. This saved me during debugging.

What I'd Do Differently Next Time

If I built this again from scratch, I'd skip the custom retry logic and use a battle-tested library like aiolimiter for rate limiting and tenacity for retries. They do the same thing, just more polished. But building my own taught me the nuances—especially the importance of jitter and the need to fail fast on non-retryable errors.

Also, I'd add circuit breaker logic. If the API returns 429 ten times in a minute, stop trying entirely for 30 seconds. That's a future improvement.

Your Turn

Rate limits are a fact of life when working with external APIs. The combo of a distributed rate limiter and exponential backoff with jitter has saved me countless hours. But I'm sure there are even better patterns out there.

What's your setup look like? Do you use token buckets, sliding windows, or something else? I'd love to hear how you handle this in production.

DEV Community