When Your AI Provider Fails: Building a Resilient Fallback System

#ai #python #webdev #api

I was showing off my new side project at a virtual meetup when it happened. The demo froze. I refreshed—nothing. The API calls to my single AI provider were all returning 503s. I had built an app that relied entirely on one service, and it was down. My carefully polished feature turned into a blank loading spinner. That night, I decided I'd never let that happen again.

The Problem with Single-Provider Dependency

We all love the convenience of a single AI API. You pick one, integrate it, and move on. But here's the thing: outages aren't rare. Rate limits creep up on you during a demo. Costs can spike without warning. And if that one provider changes their pricing or deprecates a model endpoint, you're stuck rewriting code.

I started with OpenAI. It worked great for months. Until it didn't. First a rate limit during my demo, then a small pricing hike that doubled my monthly bill. I tried switching to another provider manually, but that meant changing every API call across my codebase. Not scalable.

What I Tried That Didn't Work

Round-robin switching – I wrote a simple config that alternated providers on each request. Problem: if provider A was down, half my requests still failed. No good.

Manual fallback in code – I added try/except blocks around every API call and caught exceptions to retry with another provider. It worked, but my code turned into a tangled mess of nested conditions. Plus, I had to hardcode credentials for each provider everywhere.

Using a third-party aggregation service – I explored services that offer a unified API and fallback. They worked, but I was still dependent on a single aggregator. Also, most charged extra for the fallback routing.

What I really wanted was a lightweight, provider-agnostic way to define fallback logic without heavy dependencies. Something I could control fully.

What Eventually Worked: A Multi-Provider Fallback Layer

I built a small Python module that treats every AI provider as a pluggable component. The core idea: an abstract interface, a list of providers with priority, and a retry/fallback loop that checks health and rotates through them intelligently.

Here's the gist (full code available in this gist):

from abc import ABC, abstractmethod
import asyncio
from typing import List, Optional

class AIProvider(ABC):
    """Abstract base for any AI provider."""
    @abstractmethod
    async def complete(self, prompt: str, **kwargs) -> str:
        pass

    @abstractmethod
    async def health_check(self) -> bool:
        pass

Then I implemented a few providers. For OpenAI:

import openai

class OpenAIProvider(AIProvider):
    def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"):
        self.client = openai.AsyncOpenAI(api_key=api_key)
        self.model = model

    async def complete(self, prompt: str, **kwargs) -> str:
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        return response.choices[0].message.content

    async def health_check(self) -> bool:
        try:
            await self.client.models.retrieve(self.model)
            return True
        except:
            return False

And for another provider, say a hypothetical service at ai.interwestinfo.com:

import httpx

class InterwestProvider(AIProvider):
    def __init__(self, api_key: str, base_url: str = "https://ai.interwestinfo.com/v1"):
        self.client = httpx.AsyncClient(base_url=base_url, headers={"Authorization": f"Bearer {api_key}"})

    async def complete(self, prompt: str, **kwargs) -> str:
        resp = await self.client.post("/completions", json={"prompt": prompt, **kwargs})
        resp.raise_for_status()
        return resp.json()["choices"][0]["text"]

    async def health_check(self) -> bool:
        try:
            resp = await self.client.get("/health")
            return resp.status_code == 200
        except:
            return False

The fallback orchestrator:

class FallbackRouter:
    def __init__(self, providers: List[AIProvider], max_retries: int = 3):
        self.providers = providers
        self.max_retries = max_retries

    async def complete_with_fallback(self, prompt: str, **kwargs) -> str:
        last_exception = None
        for attempt in range(self.max_retries):
            for provider in self.providers:
                if not await provider.health_check():
                    continue
                try:
                    return await provider.complete(prompt, **kwargs)
                except Exception as e:
                    last_exception = e
                    print(f"Provider {provider.__class__.__name__} failed: {e}")
                    continue
            # Optional: wait before retrying all providers again
            await asyncio.sleep(min(2 ** attempt, 10))
        raise RuntimeError("All providers failed") from last_exception

Usage:

router = FallbackRouter([
    OpenAIProvider(api_key="sk-..."),
    InterwestProvider(api_key="iw-..."),
    # Add more providers here
])

async def main():
    result = await router.complete_with_fallback("Explain quantum computing in one sentence.")
    print(result)

Lessons Learned & Trade-offs

This approach saved my demo. Now if a provider goes down, the system silently switches to the next one. I can even run health checks on a schedule and deprioritize unhealthy providers.

But it's not free.

Latency: Each retry adds a network round-trip. For real-time apps, you'll want tighter timeouts and maybe a parallel health-check thread.
Cost: If two providers both succeed (because fallback races), you might pay for both. You need to decide if you want a conservative (try one at a time) or greedy (try all in parallel) strategy.
Complexity: You now have multiple API keys to manage, multiple rate limits to monitor, and different response formats. The abstract interface helps, but edge cases (streaming, embeddings, etc.) require more work.
IDempotency: If a provider fails after sending a request but before returning a response, you might double-process a prompt. For chat-based apps it's usually okay, but for billing or transactional use cases you need dedup.

When NOT to do this:

If you're prototyping and cost is the only concern, a single provider is fine.
If your app can tolerate periodic outages (e.g., batch processing where you retry later), you don't need fallback.
If you depend on a specific model that only one provider offers (like GPT-4 Vision), you can't fallback to a different model without adjusting your prompt.

What I'd Do Differently Next Time

I'd build a circuit breaker pattern instead of a simple health check. For example, if a provider fails three times in a row, stop calling it for the next minute. I'd also add better logging and metrics: track per-provider latency, error rates, and cost per call. That data would help me reorder priorities automatically.

Also, I'd consider making the fallback async-friendly from day one. My first attempt was synchronous and blocked the event loop. Use aiohttp or httpx from the beginning.

Final Thoughts

You don't need to control the AI—you just need to make sure you're not held hostage by a single pipe. The pattern of abstracting AI providers behind a common interface and layering fallback logic is simple but incredibly liberating. Every project I start now gets a providers/ folder with at least two implementations.

What about you? Have you ever been burned by an AI outage? How do you handle provider failures in your stack? I'm curious to hear what patterns others use.