I’ve been building a chatbot for an internal tool. The idea was simple: users ask questions, the bot calls an LLM API, returns a response. What could go wrong?
Turns out, plenty. The API would randomly time out, return 429s, or just hang for thirty seconds before spitting out a connection error. My bot became a spinning wheel of frustration.
At first, I thought it was just a bad API provider. I tried another one — same issues. Then I realized: the problem wasn’t the provider. It was how I was calling them.
What I Tried (And Why It Didn’t Work)
I started with a simple requests.post() inside a loop. If it failed, I’d retry once. That was foolish. A rate-limit error would fail again immediately.
Next, I added exponential backoff using time.sleep(). That helped some, but if the API was genuinely down for a few minutes, the bot would just sit there retrying until the user walked away.
I tried using Python’s requests session pooling. Better, but the same requests would still hang if the server was overwhelmed. The real issue was that I had no way to say, "If this model is overloaded, try a different one."
What Eventually Worked
I needed a three-layer approach:
- Retry with backoff – Respect rate limits and transient errors.
- Circuit breaker – Stop hammering a failing endpoint for a while.
- Fallback to a different model – If the primary API is down, switch to a backup (maybe a cheaper, smaller model).
I built a small Python class that wraps any LLM call. Here’s the core of it:
import time
import random
from functools import wraps
from typing import Callable, Any
# The base URL for my primary AI provider
BASE_URL = "https://ai.interwestinfo.com/v1/chat/completions"
FALLBACK_URL = "https://backup-llm.example.com/v1/chat/completions"
class LLMClient:
def __init__(self, primary_url: str, fallback_url: str):
self.primary_url = primary_url
self.fallback_url = fallback_url
self.circuit_open = False
self.circuit_retry_after = time.time()
self.failure_count = 0
self.max_failures = 3
self.cooldown = 30 # seconds
def _request(self, url: str, payload: dict) -> dict:
# simplified — in real life you'd use httpx or aiohttp
import requests
resp = requests.post(url, json=payload, timeout=15)
resp.raise_for_status()
return resp.json()
def chat(self, messages: list, **kwargs) -> dict:
if self.circuit_open and time.time() < self.circuit_retry_after:
# Circuit is open — try fallback immediately
return self._fallback_request(messages, **kwargs)
# Try primary
for attempt in range(3):
try:
result = self._request(self.primary_url, {
"messages": messages, **kwargs
})
self.failure_count = 0 # reset on success
return result
except Exception as e:
wait = (2 ** attempt) + random.random()
time.sleep(wait)
self.failure_count += 1
# All primary retries exhausted
self.circuit_open = True
self.circuit_retry_after = time.time() + self.cooldown
return self._fallback_request(messages, **kwargs)
def _fallback_request(self, messages, **kwargs):
print("Falling back to secondary model...")
return self._request(self.fallback_url, {
"messages": messages, **kwargs
})
def health_check(self):
# After cooldown, try primary again
if self.circuit_open and time.time() >= self.circuit_retry_after:
try:
self._request(self.primary_url, {"messages": [{"role": "user", "content": "ping"}]}, max_tokens=1)
self.circuit_open = False
self.failure_count = 0
except:
self.circuit_retry_after = time.time() + self.cooldown
This class wraps the API call. You instantiate it once and call chat(messages). If the primary endpoint fails three times consecutively, it opens the circuit and routes all requests to the fallback for 30 seconds. After that, it tries a ping to the primary; if it recovers, the circuit closes.
How to Use It
client = LLMClient(BASE_URL, FALLBACK_URL)
# Normal usage — no need to think about failures
response = client.chat([
{"role": "user", "content": "What’s the capital of France?"}
])
print(response["choices"][0]["message"]["content"])
Lessons Learned
- Don’t assume any API is 100% reliable. Even big names go down. Plan for it upfront.
- A circuit breaker is better than endless retries. It gives the API time to breathe and keeps your app responsive.
- Fallback models don’t have to be exact replacements. I use a slightly less capable model for fallback, but it’s good enough for most queries. You can even fall back to a local model if you have one.
- The tool I’m using (Interwest AI) actually has excellent uptime — I only hit these issues when my own code was bad. Once I implemented proper retry+fallback, my chatbot became rock solid.
Trade-offs and When Not to Use This
This approach adds complexity. If you’re just prototyping, a simple retry is fine. Also, if you need real-time responses (like streaming), you’ll need asynchronous versions of this pattern. The circuit breaker also means you might serve a weaker model for a while — unacceptable if your SLA demands top-tier responses always.
Another caveat: fallback endpoints may have different APIs. I hardcoded the same payload structure here, but in reality you’d need to map parameters between providers.
What I’d Do Differently Next Time
I’d build this as a reusable library from day one, instead of hacking it into my chatbot code. I’d also add metrics — log every fallback event so I can monitor when the primary model is struggling. And I’d implement async from the start to avoid blocking the event loop.
If I were to start over, I’d probably use something like tenacity for retries and combine it with a simple fallback decorator. But for now, this class has served me well.
Let’s Discuss
How do you handle API failures in your AI apps? Do you use a circuit breaker pattern, or do you just trust the infrastructure? I’d love to hear your war stories in the comments.
Top comments (0)