zhongqiyue

Posted on Jun 26

Why my AI app kept failing (and how I fixed rate limits & retries)

#python #ai #api #tutorial

I'm a solo developer who likes to build AI-powered tools on the side. A few months ago I was working on a content analysis service that called multiple language models to extract topics, sentiment, and summaries from user-submitted text. It worked beautifully in my local tests. But as soon as I deployed it and actual users started hitting it, everything fell apart.

Requests returned 429s. The app would hang for minutes. Some results came back empty. I spent two weekends debugging what I thought was a bug in my code, but the real problem was how I was talking to the AI APIs.

Here's what I tried, what failed, and the pattern I eventually landed on that actually works under real traffic.

The naive approach that burned me

My first version was embarrassingly simple: sync requests inside a for loop, with a simple time.sleep(1) between calls. It looked something like this:

import requests
import time

def analyze(texts):
    results = []
    for t in texts:
        resp = requests.post(
            AI_API_URL,    # Back then I was using a generic LLM endpoint
            json={"prompt": t},
            headers={"Authorization": "Bearer " + API_KEY}
        )
        if resp.ok:
            results.append(resp.json())
        time.sleep(1)  # polite? not really
    return results

This worked for 5 texts. When I needed to process 500, it took 10 minutes and eventually started timing out. Also, the API had a strict concurrency limit that I wasn't respecting.

What I tried next (and why it wasn't enough)

Just use threading

I wrapped the loop with concurrent.futures.ThreadPoolExecutor. Suddenly I was sending 10 requests at once. The API let me do a few, then blocked my IP for an hour. Threading without rate limiting is like pouring gasoline on a fire.

Add a simple retry with backoff

I wrote a decorator that catches requests.exceptions.RequestException and retries after a fixed delay. But all retries would fire at the same second, so if I had 10 concurrent failures, they'd all retry simultaneously — same problem.

Store partial results

I started saving successful calls and skipping failures. That helped avoid total data loss, but it didn't fix the root cause: I was hammering the API without respecting its limits.

The approach that finally worked

I needed three things:

Async I/O – so I didn't waste time waiting for responses.
Exponential backoff with jitter – to spread out retries.
A semaphore – to cap concurrency exactly to the API's limit.

I also added a simple circuit breaker – if we get too many 429s, stop trying for a while.

Here's the core pattern I landed on, using aiohttp, asyncio, and the excellent tenacity library for retries.

import asyncio
import aiohttp
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

class AIClient:
    def __init__(self, api_key, max_concurrency=5):
        self.api_key = api_key
        self.semaphore = asyncio.Semaphore(max_concurrency)
        self.session = None
        self.circuit_open = False

    async def __aenter__(self):
        self.session = aiohttp.ClientSession(
            headers={"Authorization": f"Bearer {self.api_key}"},
            timeout=aiohttp.ClientTimeout(total=30)
        )
        return self

    async def __aexit__(self, *args):
        await self.session.close()

    @retry(
        stop=stop_after_attempt(5),
        wait=wait_exponential(multiplier=1, min=2, max=60, exp_base=2),
        retry=retry_if_exception_type((aiohttp.ClientError, asyncio.TimeoutError)),
        reraise=True
    )
    async def _call_api(self, endpoint, payload):
        if self.circuit_open:
            raise RuntimeError("Circuit breaker open")
        async with self.semaphore:
            async with self.session.post(endpoint, json=payload) as resp:
                if resp.status == 429:
                    # Trigger circuit breaker after 3 consecutive 429s
                    # (implemented separately via a sliding window)
                    raise aiohttp.ClientResponseError(
                        resp.request_info, resp.history,
                        status=429, message="Rate limited"
                    )
                resp.raise_for_status()
                return await resp.json()

    async def analyze_many(self, texts, endpoint):
        tasks = [
            asyncio.create_task(self._call_api(endpoint, {"text": t}))
            for t in texts
        ]
        # Use return_exceptions=True to collect failures
        results = await asyncio.gather(*tasks, return_exceptions=True)
        # Separate successful results from errors
        successes = [r for r in results if not isinstance(r, Exception)]
        failures = [(i, r) for i, r in enumerate(results) if isinstance(r, Exception)]
        return successes, failures

Some notes:

The tenacity decorator handles exponential backoff with jitter automatically (by default it adds random jitter).
The semaphore ensures we never send more than max_concurrency requests simultaneously.
I moved the ClientSession into an async context manager so the connection pool is reused.
return_exceptions=True prevents one bad request from killing the whole batch. Then I can log failures and optionally retry them later.

Putting it together

async def main():
    # Example endpoint (this was a third-party LLM service I was using)
    ai_endpoint = "https://api.example.com/v1/analyze"

    texts = ["text1", "text2", ...]  # your 500 texts

    async with AIClient(api_key=API_KEY, max_concurrency=5) as client:
        successes, failures = await client.analyze_many(texts, ai_endpoint)
        print(f"Success: {len(successes)}, Failed: {len(failures)}")
        for idx, err in failures:
            print(f"  Index {idx}: {err}")

Lessons learned

Always respect the API's rate limits from day one. Even if you're just prototyping, don't assume you'll fix it later. The fix is hard to bolt on.
Exponential backoff with jitter is not optional — it's the difference between a stable system and a thundering herd.
Use asyncio for I/O-bound work, especially when you have many similar requests. The performance gain is huge without the overhead of threads.
Design for partial failure. Your pipeline should gracefully handle some calls failing. Gather results, log errors, and decide whether to retry offline.

What I'd do differently next time

I'd start with a proper message queue (like Redis + RQ or Celery) to decouple request ingestion from processing. That way I could control the inflow of API calls independently of user traffic. I'd also monitor rate limit headers (e.g., X-RateLimit-Remaining) and dynamically adjust concurrency.

One more thing: I'd build a mock API server for local testing of rate limits. It's too easy to exhaust your real quota while debugging.

The tool I used (just one example)

The approach above is generic. For a recent project I had to call a custom LLM endpoint hosted at https://ai.interwestinfo.com/. Their API had a strict 5 concurrent request limit. Without the semaphore and backoff, I would have been blocked constantly. The pattern I described works for any HTTP API with rate limits.

When not to use this pattern

If your API calls are idempotent and you can afford to lose some data, a simpler fire-and-forget with a dead letter queue might be enough.
If you need strict ordering and sequencing, async with concurrency can get tricky. You might want a single-threaded producer-consumer.
If your API is incredibly stable with no limits (rare), over-engineering with exponential retry can add latency.

Your turn

Rate limiting and retries are a universal pain when integrating any external API. This pattern has been a lifesaver for me, but I know there are many other strategies out there — circuit breakers, bulkhead isolation, client-side throttling with token buckets.

How do you handle rate limits in your projects? I'd love to hear what patterns you've used (or regretted).

DEV Community