Your agent calls gpt-4o. OpenAI returns a 429. Your agent crashes, your user sees nothing.
LLM APIs fail more than you think -- rate limits, outages, content-policy refusals. A single-provider agent is a single point of failure. The fix takes five minutes: a fallback chain that tries the next model automatically.
The Code
import os
from openai import OpenAI, APIError, RateLimitError, APITimeoutError
# Each entry: (base_url, api_key_env, model_name)
MODEL_CHAIN = [
("https://api.openai.com/v1", "OPENAI_API_KEY", "gpt-4o"),
("https://api.anthropic.com/v1", "ANTHROPIC_API_KEY", "claude-3-5-sonnet-20241022"),
("https://openrouter.ai/api/v1", "OPENROUTER_API_KEY", "meta-llama/llama-3-70b-instruct"),
]
def chat_with_fallback(messages: list[dict], temperature: float = 0.7) -> str:
"""Try each model in MODEL_CHAIN until one succeeds."""
errors = []
for base_url, key_env, model in MODEL_CHAIN:
try:
client = OpenAI(
base_url=base_url,
api_key=os.environ[key_env],
)
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
timeout=10,
)
print(f"[OK] Response from {model}")
return response.choices[0].message.content
except (APIError, RateLimitError, APITimeoutError, KeyError) as e:
print(f"[FAIL] {model}: {e}")
errors.append((model, str(e)))
continue
raise RuntimeError(
f"All models failed: {errors}"
)
# --- Run it ---
if __name__ == "__main__":
result = chat_with_fallback([
{"role": "user", "content": "Explain DNS in one sentence."}
])
print(f"\nAnswer: {result}")
That is the entire implementation. No framework, no library -- just the openai SDK and a loop.
How It Works
MODEL_CHAIN is an ordered list of providers. Each entry has a base_url (most LLM providers expose an OpenAI-compatible endpoint), the environment variable holding the API key, and the model name.
chat_with_fallback iterates through the chain. It builds a fresh OpenAI client per provider, fires the request, and returns the first successful response. If a call throws RateLimitError, APITimeoutError, or any APIError, it logs the failure and moves to the next model.
The KeyError catch handles the case where you haven't set an API key for a provider -- it skips that model instead of crashing.
The timeout=10 on each call prevents a slow provider from blocking the entire chain. If all models fail, the function raises a RuntimeError with a summary of every error.
Expected Output
When your primary model is rate-limited:
[FAIL] gpt-4o: Error code: 429 - Rate limit exceeded
[OK] Response from claude-3-5-sonnet-20241022
Answer: DNS translates human-readable domain names into IP addresses so your browser knows which server to contact.
The user never knows anything went wrong. Your agent stays alive.
Tips for Production
Order by cost. Put your cheapest model first and your most expensive last. Fallbacks should escalate cost, not the other way around.
Add jitter between retries. If the first model failed due to a rate limit, hitting it again immediately will fail again. A time.sleep(0.5) between attempts helps.
Log every fallback. That print statement should be a structured log in production. If you are falling back 40% of the time, your primary provider has a problem.
This is part of the AI Agent Quick Tips series -- short, copy-paste patterns for building production agents. Previous tips cover retry logic, streaming responses, and guardrails.
If managing provider keys and fallback chains sounds like plumbing you would rather skip, Nebula handles model routing for you -- but the pattern above works anywhere.
Top comments (0)