DEV Community

zhongqiyue
zhongqiyue

Posted on

How I Stopped Worrying and Learned to Love AI API Retries

I remember the day my AI-powered feature went down in production. It was a Friday afternoon. Users were complaining that the "smart search" was returning empty results. The root cause? A flaky AI API that occasionally returned 500 errors — and my code did absolutely nothing about it. It just crashed.

I'd spent two weeks building a neat little pipeline that sent user queries to an AI model, parsed the output, and displayed structured results. It worked beautifully in my local tests. But in production, with real traffic, the API would time out or throw errors maybe 5% of the time. That 5% ruined the experience for a chunk of users.

This article is the story of how I went from naive single-request calls to a resilient, retry-based approach that saved my weekend. If you're integrating any external AI API, you've probably hit this wall too.

The Naive Approach (What I Tried First)

Here's what my original code looked like. I'm using Python with requests and OpenAI's API as an example, but the pattern is universal:

import requests

def extract_entities(query):
    response = requests.post(
        "https://api.example.com/v1/extract",
        json={"text": query},
        timeout=10
    )
    return response.json()
Enter fullscreen mode Exit fullscreen mode

Clean, simple, and wrong. The first time I got a ConnectionError or a 504 Gateway Timeout, my entire app crashed. My first "fix" was to wrap it in a try/except:

def extract_entities(query):
    try:
        response = requests.post(...)
        response.raise_for_status()
        return response.json()
    except:
        return None
Enter fullscreen mode Exit fullscreen mode

Now it didn't crash, but it silently returned None — users saw empty results. Worse, the error could be transient (a temporary network blip), but my code gave up instantly.

The Dead End: Naive Retries

I then added a simple retry loop with a fixed delay:

import time

def extract_entities(query):
    for attempt in range(3):
        try:
            response = requests.post(...)
            response.raise_for_status()
            return response.json()
        except:
            time.sleep(1)  # wait a second, then try again
    return None
Enter fullscreen mode Exit fullscreen mode

Better, but flawed. If the API was overloaded, sleeping for exactly 1 second and retrying three times still resulted in failure. I was hammering the API at the same pace. Also, I was blocking the entire thread — no async, no queue. My app became slow and unresponsive.

What Actually Worked: Exponential Backoff + Async Queue

I realized I needed three things:

  1. Exponential backoff — wait longer between retries, so the API has time to recover.
  2. Jitter — add randomness to avoid the thundering herd problem.
  3. Async background queue — don't block the main request thread. Let users get an answer later if the API is slow.

Here's the final pattern that saved my production system. I'll use asyncio and aiohttp for the async part, but the same idea works with threads or task queues like Celery.

import asyncio
import aiohttp
import random

async def fetch_with_retry(session, url, json_data, max_retries=5):
    for attempt in range(max_retries):
        try:
            async with session.post(url, json=json_data, timeout=10) as response:
                response.raise_for_status()
                return await response.json()
        except (aiohttp.ClientError, asyncio.TimeoutError) as e:
            if attempt == max_retries - 1:
                raise  # last attempt failed, let caller handle
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Retry {attempt+1} in {wait_time:.2f}s due to {e}")
            await asyncio.sleep(wait_time)

async def extract_entities(query, result_queue):
    async with aiohttp.ClientSession() as session:
        try:
            result = await fetch_with_retry(
                session,
                "https://api.example.com/v1/extract",
                {"text": query}
            )
            await result_queue.put(result)
        except Exception as e:
            await result_queue.put({"error": str(e)})

# Usage in your web handler (FastAPI example)
from fastapi import FastAPI, BackgroundTasks

app = FastAPI()

@app.post("/search")
async def search(query: str, background_tasks: BackgroundTasks):
    result_queue = asyncio.Queue()
    background_tasks.add_task(extract_entities, query, result_queue)
    # Return a placeholder immediately, the result will be fetched via polling or websocket
    return {"status": "processing", "result_id": id(query)}
Enter fullscreen mode Exit fullscreen mode

Caching: The Unsung Hero

Another thing I learned the hard way: if two users ask the same question within 5 minutes, you're paying twice and waiting twice. I added a simple in-memory cache with TTL:

import time
from functools import lru_cache

cache = {}
def get_cached_or_fetch(query):
    now = time.time()
    if query in cache and now - cache[query]["timestamp"] < 300:
        return cache[query]["data"]
    # ... fetch with retry ...
    cache[query] = {"data": result, "timestamp": now}
    return result
Enter fullscreen mode Exit fullscreen mode

This cut my API costs by 30% and improved latency for repeated queries.

Lessons Learned

  • Retries alone aren't enough — you need backoff, jitter, and a way to isolate retries from the user's request lifecycle.
  • Async is a game-changer — blocking the event loop during a retry chain will kill your throughput.
  • Cache aggressively — AI APIs are expensive and often rate-limited. Caching identical requests saves money and time.
  • Monitor your retries — log each retry attempt and failure. I added structured logging so I could see when the API was unhealthy.

When Not to Use This Approach

This pattern works well for idempotent requests — where the same query always gives the same result. If your API is not idempotent (e.g., it creates a resource each time), retrying blindly can cause duplicates. In that case, you need idempotency keys or exactly-once semantics.

Also, if your API has very strict rate limits (e.g., 1 request per second), even exponential backoff might not help. You'd need to implement a token bucket or a proper queue with scheduled jobs.

What I'd Do Differently Next Time

I'd start with a proper async task queue (like Celery or RQ) instead of building ad-hoc background tasks. That would give me retry management, dead letter queues, and monitoring out of the box. For a quick prototype, the async pattern above is fine, but for production, you need durability.

Also, I'd use a library like tenacity for retry logic instead of writing my own loop.


Building resilient AI integrations is a journey. My code now gracefully handles transient failures, and users barely notice when the API sneezes. If you've battled flaky AI APIs, I'd love to hear your war stories. What's your retry strategy look like?

Top comments (0)