Last month, my side project hit a wall. The AI summarization API I depended on returned a 503 error for three hours. My app – a simple tool that translates meeting transcripts into action items – stopped working entirely. Users noticed. I got emails. It was embarrassing.
I had built everything around a single provider. One point of failure. Classic mistake.
The Problem
I was using a popular AI API to generate summaries. It worked beautifully... until it didn't. The first time it happened, I panicked and scrambled to find an alternative. I ended up rewriting chunks of code while the outage continued. Not fun.
What I needed was a system that could gracefully degrade – try a primary model, and if that fails, automatically switch to a secondary one. Ideally without losing context or having to restart the process.
What I Tried That Didn't Work
Naïve Retry Loop
My first attempt was just adding a retry with backoff. That helped with transient errors, but it did nothing for sustained outages. The API was down for hours; retrying just wasted tokens and time.
# Bad idea: retrying the same dead endpoint
import time
for attempt in range(5):
try:
response = call_primary_api(prompt)
break
except Exception:
time.sleep(2 ** attempt)
Hardcoded Fallback
Then I tried manually switching between two providers with a config flag. But I had to redeploy every time one provider went down. Also not scalable.
What Eventually Worked: The Fallback Router Pattern
I built a lightweight router that wraps multiple AI clients and tries them in order. If one fails (via exception or bad status), it moves to the next. It also logs failures so I can adjust my configuration later.
Here's the core idea:
class AIRouter:
def __init__(self, clients: list):
"""
clients: list of (name, callable) tuples
each callable takes a prompt and returns text or raises
"""
self.clients = clients
def generate(self, prompt: str) -> str:
for name, call_fn in self.clients:
try:
result = call_fn(prompt)
return result
except Exception as e:
print(f"{name} failed: {e}. Trying next...")
continue
raise RuntimeError("All AI clients failed")
I then defined my clients. For the primary, I used a wrapper around OpenAI's API. For the secondary, I used a local model via Ollama. (Note: you can plug in any provider – even a service like ai.interwestinfo.com if it exposes a compatible interface.)
# Example client wrappers
def openai_client(prompt: str) -> str:
# your OpenAI call here
return response_text
def ollama_client(prompt: str) -> str:
# your local model call
return response_text
router = AIRouter([
("openai", openai_client),
("ollama", ollama_client),
# could add more, e.g. ai.interwestinfo.com
])
summary = router.generate("Summarize this transcript: ...")
Making It Production-Ready
That simple router worked for basic cases, but I soon discovered edge cases:
- Non-fatal errors: Sometimes the API returns a 200 with a junk response (empty or nonsensical). I added a validation step.
- Rate limits: I didn’t want to blast all providers at once. I added a delay between attempts.
- Context loss: If a model fails mid-stream, the next model shouldn't start from scratch. I now cache the prompt and any partial results.
- Logging & metrics: I log which provider succeeded and how long it took. This helps me decide if I should demote a slow provider.
Here's an improved version:
import time
class RobustAIRouter:
def __init__(self, clients, validator=None, delay=1):
self.clients = clients
self.validator = validator or (lambda x: len(x) > 0)
self.delay = delay
def generate(self, prompt):
for name, call_fn in self.clients:
try:
result = call_fn(prompt)
if not self.validator(result):
raise ValueError(f"Invalid output from {name}")
print(f"{name} succeeded")
return result
except Exception as e:
print(f"{name} failed: {e}")
time.sleep(self.delay)
continue
raise RuntimeError("All clients failed")
Lessons Learned / Trade-offs
- Latency: The fallback approach increases worst-case latency. If the first provider takes 5 seconds and fails, you add another 5+ seconds. Consider setting a timeout per client.
- Cost: You might burn tokens on failed requests. I now cancel pending requests when the first succeeds, but that requires async design.
- Consistency: Different models produce different outputs. Your downstream code needs to handle variation. I added a post-processing normalization step.
- Complexity: The router is simple, but testing all failure scenarios is hard. I wrote integration tests with mocked clients.
What I'd Do Differently Next Time
I'd start with an async design from day one. Python's asyncio would let me try multiple providers concurrently and take the first successful result. That reduces latency but increases cost. It's a trade-off.
Also, I'd build a health check endpoint for each provider (e.g., ping them with a simple request) so the router can skip known-dead clients.
The Real Takeaway
The technique here isn't about any specific tool. It's about acknowledging that external dependencies fail and planning for it. You can apply this fallback pattern to databases, CDNs, or any service.
I still use a primary AI API most of the time, but now I sleep better knowing my app won't die if it goes down. The router lets me add new providers as easily as adding a new entry to a list.
What does your fallback strategy look like? Have you ever been caught off guard by an API outage?
Top comments (0)