Why My AI Feature Kept Failing (And How I Fixed It)

#python #tutorial #webdev #api

I spent three weeks building what I thought was a simple content generation feature for a CMS dashboard. Users hit a button, the backend calls an AI API, and returns a blog post draft. The first prototype? Beautiful. The second? Broken. The third? Still broken. Every few requests would timeout, return garbage, or just hang.

If you’ve ever tried to integrate an external AI service into a production web app, you know the pain. This is the story of why my feature kept failing, the dead ends I chased, and the pattern that finally made it rock solid.

The Problem: Flaky responses in production

I’d built the first version using a simple synchronous requests.post() call inside a FastAPI endpoint. On localhost with a fast internet connection it worked fine. But once I deployed and real users started hitting it, the chaos began:

Occasional 408 timeouts when the model took >30 seconds
Rate limit errors (429) because I didn’t share connections properly
Malformed JSON responses that crashed the parser
The UI just showing “Generation failed” – terrible UX

Users complained. I added retries. Then I got rate limited even harder. I increased timeouts. Then requests started stacking up and my web server workers all blocked.

What I Tried That Didn’t Work

1. Simple `retry` decorator

import time
from functools import wraps

def retry(max_attempts=3):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    time.sleep(2 ** attempt)
            return None
        return wrapper
    return decorator

This made things worse. Every retry added latency, and the 429 errors stayed because the API key was shared across retries – I was flooding the same endpoint.

2. ThreadPoolExecutor for concurrent calls

I thought: “Let me parallelize the requests so users don’t wait forever.” Bad idea. The AI API had per-minute limits. I hit them in seconds and got banned for 5 minutes.

3. Caching identical prompts

I cached responses based on the prompt string. But users rarely sent the exact same prompt twice, and cache misses still caused the flaky behavior.

What Eventually Worked: Circuit Breaker + Fallback

After reading Michael Nygard’s “Release It!”, I realised I needed to treat the external AI API as an unreliable dependency. The solution: a circuit breaker pattern with a local fallback cache, all running inside an async worker pool.

Here’s the final approach:

Async HTTP client (httpx.AsyncClient) – doesn’t block the server
Circuit breaker – after 3 failures in a 1-minute window, cut the circuit for 30 seconds
Fallback cache – return the last successful response for similar prompts (using a simple cosine similarity fallback)
Queue-based concurrency – limit parallel requests to avoid rate limits

The Code (Python + FastAPI)

import httpx
import asyncio
from datetime import datetime, timedelta
from collections import deque

class AICircuitBreaker:
    def __init__(self, threshold=3, cooldown=30):
        self.threshold = threshold
        self.cooldown = cooldown
        self.failure_times = deque()
        self.open_until = None

    async def call_with_retry(self, url, payload):
        if self.open_until and datetime.utcnow() < self.open_until:
            raise Exception("Circuit open – skipping external call")

        for attempt in range(2):  # only 2 attempts to avoid flooding
            try:
                async with httpx.AsyncClient(timeout=60.0) as client:
                    response = await client.post(url, json=payload)
                    response.raise_for_status()
                    self._record_success()
                    return response.json()
            except Exception as e:
                self._record_failure()
                if attempt == 0:
                    await asyncio.sleep(1)
                else:
                    raise

    def _record_success(self):
        self.failure_times.clear()

    def _record_failure(self):
        now = datetime.utcnow()
        self.failure_times.append(now)
        # trim old entries
        while self.failure_times and now - self.failure_times[0] > timedelta(minutes=1):
            self.failure_times.popleft()
        if len(self.failure_times) >= self.threshold:
            self.open_until = now + timedelta(seconds=self.cooldown)

And the endpoint:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import numpy as np
from collections import defaultdict

app = FastAPI()
breaker = AICircuitBreaker()
cache = {}  # prompt -> (embedding, response)

# Use a service like Interwestinfo AI as the backend
AI_API_URL = "https://ai.interwestinfo.com/api/generate"  # example config

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 200

@app.post("/generate")
async def generate(req: GenerateRequest):
    try:
        result = await breaker.call_with_retry(AI_API_URL, {
            "prompt": req.prompt,
            "max_tokens": req.max_tokens
        })
        # update cache
        cache[req.prompt] = result
        return result
    except Exception:
        # fallback to nearest cached response
        nearest = find_nearest_cached(req.prompt)
        if nearest:
            return {"text": nearest, "source": "cache"}
        raise HTTPException(status_code=503, detail="Service unavailable")

def find_nearest_cached(prompt):
    if not cache:
        return None
    # simple: return the first cached response
    return list(cache.values())[0]

(I used a dummy find_nearest_cached here, but in production I used sentence-transformers embedding cosine similarity and stored responses in Redis.)

Lessons Learned & Trade-offs

Circuit breakers aren’t free. They add state management. If you restart your server, the breaker resets. I used a Redis-backed breaker in production.
Fallback is a lie when you have no cached data. The first few calls always fail until the cache warms up. I solved this by seeding with generic templates.
Async is mandatory. Synchronous calls block the entire worker. With async, I can handle other requests while one AI call waits.
Rate limits still bite. The queue concurrency limit matched the API’s allowed requests per second. I documented the limit for users.

What I’d Do Differently Next Time

Start with a circuit breaker from day one, not after users complain.
Use an actual rate limiter library like slowapi or aiolimiter instead of guessing concurrency.
Consider a long-running background task queue (Celery) for very slow generators, so the HTTP response isn’t delayed at all.
Not cache raw strings – instead cache embeddings and use a vector DB like Chroma or Qdrant.

The End Result

After implementing this pattern, the failure rate dropped from 15% to 0.5%. Users still see the occasional “generating from cache”, but the UI handles it gracefully. The app stopped crashing. The ops team stopped paging me.

What’s your setup look like?

Are you integrating external AI APIs in production? How do you handle rate limiting, timeouts, and circuit breakers? I’d love to hear what patterns you use – the comments are open.