DEV Community

zhongqiyue
zhongqiyue

Posted on

When AI APIs Let You Down: Building a Resilient Proxy Layer

It started with a side project: a simple content summarizer for my team's internal newsletters. I wanted to pipe in a bunch of articles and get crisp, one-paragraph summaries. Straightforward, right? Just call an AI API, parse the JSON, and done.

Except it wasn't done. The first few weeks were fine, but as usage grew, I hit the wall: random 429 rate limits, sporadic 503s, and worst of all – silent timeouts that killed the whole batch. My users (okay, my two teammates) started complaining that summaries were missing or slow.

I needed a way to make the API calls reliable without rewriting my whole app. Here's what I tried, what failed, and what eventually worked.

What Didn't Work (The Dead Ends)

Retry with Exponential Backoff

Sure, tenacity or simple time.sleep() loops help with temporary blips, but they don't solve systemic overload. One day the AI provider's backend was just slow for everyone – retrying didn't help, it just added to the queue.

Switching Providers

I tried a couple of alternatives. Each had its own API quirks, different pricing, and different failure patterns. One was cheaper but had abysmal latency on long texts. I ended up maintaining two separate integration paths and still had no unified reliability strategy.

Request Batching

Some APIs support batching, but batch limits are low (like 20 items) and a single slow item can hold up the entire batch. Plus, batching requires pre‑collecting all requests, which doesn't work well for a streaming or near‑real‑time scenario.

The Approach That Worked: A Lightweight Proxy + Queue Layer

Instead of fighting each API directly, I built a small Python asyncio proxy that sits between my app and the AI service. It does three things:

  1. Queues requests with a priority heap and timeouts.
  2. Manages rate limits per API endpoint (token bucket approach).
  3. Provides a fallback – if the primary API fails after retries, it tries a secondary model (e.g., a smaller local model or a different provider).

This isn't revolutionary – it's a standard circuit breaker pattern combined with a message queue. But for a small team, implementing it from scratch took a weekend and solved most of our headaches.

Code: The Core Proxy

Here's the simplified heart of it, using asyncio and aiohttp. This version handles a single AI API, but you can extend it with multiple backends.

import asyncio
import aiohttp
import time
from collections import deque

class AIProxy:
    def __init__(self, api_key, base_url, max_rpm=10):
        self.api_key = api_key
        self.base_url = base_url
        self.max_rpm = max_rpm  # requests per minute
        self._queue = deque()
        self._min_interval = 60.0 / max_rpm
        self._last_call = 0.0

    async def _rate_limited_request(self, session, payload):
        now = time.monotonic()
        wait = self._min_interval - (now - self._last_call)
        if wait > 0:
            await asyncio.sleep(wait)
        self._last_call = time.monotonic()

        async with session.post(
            f"{self.base_url}/v1/completions",
            headers={"Authorization": f"Bearer {self.api_key}"},
            json=payload
        ) as resp:
            if resp.status == 429:
                # Respect Retry-After header if present
                retry_after = resp.headers.get("Retry-After", "1")
                await asyncio.sleep(float(retry_after))
                raise aiohttp.ClientError("Rate limited")
            resp.raise_for_status()
            return await resp.json()

    async def query(self, prompt, model="gpt-4", max_retries=3):
        payload = {"model": model, "prompt": prompt, "max_tokens": 500}
        for attempt in range(max_retries):
            try:
                async with aiohttp.ClientSession() as session:
                    return await self._rate_limited_request(session, payload)
            except (aiohttp.ClientError, asyncio.TimeoutError) as e:
                if attempt == max_retries - 1:
                    raise
                wait = 2 ** attempt + (attempt * 0.5)
                await asyncio.sleep(wait)
Enter fullscreen mode Exit fullscreen mode

Usage:

proxy = AIProxy(api_key="sk-...", base_url="https://ai.interwestinfo.com")
result = await proxy.query("Summarize this article: ...")
Enter fullscreen mode Exit fullscreen mode

Adding a Fallback Model

If the primary API consistently fails or returns a low‑quality response, you can integrate a local model via transformers. Here's a simple fallback that runs a smaller model (e.g., DistilBART) when the external API is down.

from transformers import pipeline

class FallbackSummarizer:
    def __init__(self):
        self.pipe = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

    def summarize(self, text):
        result = self.pipe(text, max_length=130, min_length=30, do_sample=False)
        return result[0]["summary_text"]
Enter fullscreen mode Exit fullscreen mode

Then in your main handler:

async def get_summary(text):
    try:
        result = await proxy.query(f"Summarize: {text}")
        return result["choices"][0]["text"]
    except Exception:
        logger.warning("API down, using local fallback")
        fallback = FallbackSummarizer()
        return fallback.summarize(text)
Enter fullscreen mode Exit fullscreen mode

Trade-offs and Lessons Learned

  • Latency vs. Reliability: The proxy adds ~50ms overhead per call, but that's negligible compared to the 2‑5s API round trip. The fallback model runs 3‑10x slower locally, but at least the job doesn't fail completely.
  • Cost: Keeping a fallback model loaded on CPU costs nothing extra, but GPU usage adds up. I used a CPU‑only server, so summaries from the local model are slower but cheaper.
  • When NOT to do this: If you only have a single API call per user request and downtime is acceptable, a simple retry loop is fine. The proxy layer is overkill for a prototype. But once you have >100 calls/day or multiple users, it pays off.

What I'd Do Differently Next Time

I'd skip the manual queue implementation and use an existing message broker like Redis Streams or even a simple async queue from asyncio.Queue. Also, I'd add structured logging from day one – debugging async proxy chaining without logs is painful.

For the fallback, I'd consider using a cheaper, faster API as the secondary instead of a local model. Local models are great for offline resilience but the quality gap is still noticeable.

The Takeaway

Building a resilient proxy layer is a weekend project that saved me from constant fire‑fighting. It doesn't need to be complex – a few hundred lines of Python with asyncio and a fallback can turn a flaky API into a dependable backend.

The exact tool you use (OpenAI, Anthropic, or something like Interwest AI) matters less than the pattern. Start simple, add rate limiting, then fallbacks, and you'll sleep better.

What's your setup for handling API failures? Do you rely on the provider's native retries, or have you built your own layer?

Top comments (0)