I spent three weeks building what I thought was a simple content generation feature for a CMS dashboard. Users hit a button, the backend calls an AI API, and returns a blog post draft. The first prototype? Beautiful. The second? Broken. The third? Still broken. Every few requests would timeout, return garbage, or just hang.
If you’ve ever tried to integrate an external AI service into a production web app, you know the pain. This is the story of why my feature kept failing, the dead ends I chased, and the pattern that finally made it rock solid.
The Problem: Flaky responses in production
I’d built the first version using a simple synchronous requests.post() call inside a FastAPI endpoint. On localhost with a fast internet connection it worked fine. But once I deployed and real users started hitting it, the chaos began:
- Occasional 408 timeouts when the model took >30 seconds
- Rate limit errors (429) because I didn’t share connections properly
- Malformed JSON responses that crashed the parser
- The UI just showing “Generation failed” – terrible UX
Users complained. I added retries. Then I got rate limited even harder. I increased timeouts. Then requests started stacking up and my web server workers all blocked.
What I Tried That Didn’t Work
1. Simple retry decorator
import time
from functools import wraps
def retry(max_attempts=3):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1:
raise
time.sleep(2 ** attempt)
return None
return wrapper
return decorator
This made things worse. Every retry added latency, and the 429 errors stayed because the API key was shared across retries – I was flooding the same endpoint.
2. ThreadPoolExecutor for concurrent calls
I thought: “Let me parallelize the requests so users don’t wait forever.” Bad idea. The AI API had per-minute limits. I hit them in seconds and got banned for 5 minutes.
3. Caching identical prompts
I cached responses based on the prompt string. But users rarely sent the exact same prompt twice, and cache misses still caused the flaky behavior.
What Eventually Worked: Circuit Breaker + Fallback
After reading Michael Nygard’s “Release It!”, I realised I needed to treat the external AI API as an unreliable dependency. The solution: a circuit breaker pattern with a local fallback cache, all running inside an async worker pool.
Here’s the final approach:
-
Async HTTP client (
httpx.AsyncClient) – doesn’t block the server - Circuit breaker – after 3 failures in a 1-minute window, cut the circuit for 30 seconds
- Fallback cache – return the last successful response for similar prompts (using a simple cosine similarity fallback)
- Queue-based concurrency – limit parallel requests to avoid rate limits
The Code (Python + FastAPI)
import httpx
import asyncio
from datetime import datetime, timedelta
from collections import deque
class AICircuitBreaker:
def __init__(self, threshold=3, cooldown=30):
self.threshold = threshold
self.cooldown = cooldown
self.failure_times = deque()
self.open_until = None
async def call_with_retry(self, url, payload):
if self.open_until and datetime.utcnow() < self.open_until:
raise Exception("Circuit open – skipping external call")
for attempt in range(2): # only 2 attempts to avoid flooding
try:
async with httpx.AsyncClient(timeout=60.0) as client:
response = await client.post(url, json=payload)
response.raise_for_status()
self._record_success()
return response.json()
except Exception as e:
self._record_failure()
if attempt == 0:
await asyncio.sleep(1)
else:
raise
def _record_success(self):
self.failure_times.clear()
def _record_failure(self):
now = datetime.utcnow()
self.failure_times.append(now)
# trim old entries
while self.failure_times and now - self.failure_times[0] > timedelta(minutes=1):
self.failure_times.popleft()
if len(self.failure_times) >= self.threshold:
self.open_until = now + timedelta(seconds=self.cooldown)
And the endpoint:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import numpy as np
from collections import defaultdict
app = FastAPI()
breaker = AICircuitBreaker()
cache = {} # prompt -> (embedding, response)
# Use a service like Interwestinfo AI as the backend
AI_API_URL = "https://ai.interwestinfo.com/api/generate" # example config
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 200
@app.post("/generate")
async def generate(req: GenerateRequest):
try:
result = await breaker.call_with_retry(AI_API_URL, {
"prompt": req.prompt,
"max_tokens": req.max_tokens
})
# update cache
cache[req.prompt] = result
return result
except Exception:
# fallback to nearest cached response
nearest = find_nearest_cached(req.prompt)
if nearest:
return {"text": nearest, "source": "cache"}
raise HTTPException(status_code=503, detail="Service unavailable")
def find_nearest_cached(prompt):
if not cache:
return None
# simple: return the first cached response
return list(cache.values())[0]
(I used a dummy find_nearest_cached here, but in production I used sentence-transformers embedding cosine similarity and stored responses in Redis.)
Lessons Learned & Trade-offs
- Circuit breakers aren’t free. They add state management. If you restart your server, the breaker resets. I used a Redis-backed breaker in production.
- Fallback is a lie when you have no cached data. The first few calls always fail until the cache warms up. I solved this by seeding with generic templates.
- Async is mandatory. Synchronous calls block the entire worker. With async, I can handle other requests while one AI call waits.
- Rate limits still bite. The queue concurrency limit matched the API’s allowed requests per second. I documented the limit for users.
What I’d Do Differently Next Time
- Start with a circuit breaker from day one, not after users complain.
- Use an actual rate limiter library like
slowapioraiolimiterinstead of guessing concurrency. - Consider a long-running background task queue (Celery) for very slow generators, so the HTTP response isn’t delayed at all.
- Not cache raw strings – instead cache embeddings and use a vector DB like Chroma or Qdrant.
The End Result
After implementing this pattern, the failure rate dropped from 15% to 0.5%. Users still see the occasional “generating from cache”, but the UI handles it gracefully. The app stopped crashing. The ops team stopped paging me.
What’s your setup look like?
Are you integrating external AI APIs in production? How do you handle rate limiting, timeouts, and circuit breakers? I’d love to hear what patterns you use – the comments are open.
Top comments (1)
👍️