Python LLM API Error Handling: A Complete Guide to 429 Rate Limits, Retries, and Failover
If you're building AI-powered applications in Python, you've probably hit this wall: your LLM provider returns a 429 (rate limit), a 502 (bad gateway), or just hangs until timeout. The first time it happens, you add a time.sleep(). The second time, you write a retry loop. By the tenth time, you're wondering if there's a better way to handle LLM API errors in production.
This guide covers the three layers of LLM API error handling every Python developer needs to know: retry logic, multi-provider failover, and fallback strategies.
Layer 1: Handle 429 Rate Limits and Transient Errors
The most common LLM API error is the 429 Rate Limit. Every provider has them — OpenAI, Anthropic, DeepSeek. The naive fix is:
import time
import openai
def call_with_retry(prompt, max_retries=3):
for i in range(max_retries):
try:
return openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
except openai.RateLimitError:
time.sleep(2 ** i) # exponential backoff
raise Exception("All retries exhausted")
This works but has problems: it doesn't respect the Retry-After header, doesn't distinguish between error types, and when retries are exhausted, your app still fails.
The Right Way: Exponential Backoff with Jitter
Production retry logic needs:
- Exponential backoff — double the wait between each attempt
- Jitter — randomize the wait to avoid thundering herd
- Retry-After respect — honor the provider's specified wait time
- Error classification — treat 429 (retryable) differently from 401 (not retryable)
A Python implementation:
import asyncio
import random
from openai import RateLimitError, APITimeoutError, APIError
async def smart_retry(coro, max_retries=5, base_delay=1.0):
for attempt in range(max_retries):
try:
return await coro
except RateLimitError as e:
retry_after = int(e.response.headers.get("Retry-After", 0))
wait = retry_after or (base_delay * (2 ** attempt) + random.uniform(0, 0.5))
print(f"429 rate limit, retrying in {wait:.1f}s...")
await asyncio.sleep(wait)
except (APITimeoutError, APIError) as e:
if attempt == max_retries - 1:
raise
wait = base_delay * (2 ** attempt)
await asyncio.sleep(wait)
raise Exception("Max retries exceeded")
But this only handles transient errors. What if your provider is down for 30 minutes?
Layer 2: Multi-Provider Failover
A single-provider retry loop can't help when the provider itself is unavailable. The Claude outage of June 2026 took Anthropic offline for 3 hours. OpenAI has had multi-hour partial outages. DeepSeek experiences periodic congestion.
Multi-provider failover means your application automatically switches to a backup provider when the primary is unavailable.
Manual Approach
providers = [
("openai", "sk-..."),
("anthropic", "sk-ant-..."),
("deepseek", "sk-..."),
]
for name, key in providers:
try:
return await call_provider(name, key, prompt)
except Exception as e:
print(f"{name} failed: {e}, trying next...")
continue
raise Exception("All providers failed")
This is better, but still naive:
- It doesn't test provider health before calling
- It switches providers on any error, even retryable ones
- No validation that the fallback output is actually correct
- Latency adds up as you try each provider in sequence
The Production Pattern: Health Monitoring + Smart Routing
A production failover system should:
- Track per-provider error rates and latency (P50/P95/P99)
- Route to the healthiest provider, not just the first
- Distinguish between retryable errors (switch) and non-retryable (immediate failover)
- Validate output after failover — model responses differ between providers
Layer 3: LLM Fallback Strategy — Graceful Degradation
The most sophisticated error handling strategy is a cascading fallback:
Request → Retry (transient errors)
→ Model Degrade (switch to cheaper model in same provider)
→ Provider Failover (switch to different provider)
→ Flywheel Learning (record patterns for faster diagnosis)
This means your application never just fails — it degrades gracefully:
- L1 — Smart Retry: 429 rate limit? Wait and retry with exponential backoff. Timeout? Retry once.
- L2 — Model Degrade: OpenAI GPT-4o keeps failing? Try GPT-4o-mini. Same API, lower cost, higher availability.
- L3 — Provider Failover: All OpenAI models failing? Switch to Anthropic Claude, then DeepSeek.
- L4 — Self-Learning: Record the failure pattern. Next time the same error appears, skip straight to the solution.
The Silent Failure Problem
There's a catch. Providers sometimes return 200 OK with garbage content — empty responses, "I cannot answer that" refusals, or JSON responses missing required fields. These are the most dangerous errors because your error handler thinks everything is fine.
A production fallback strategy must validate each response:
def validate_response(response, expected_schema=None):
checks = []
# Check 1: Was it a refusal disguised as a normal response?
checks.append(not is_refusal(response))
# Check 2: Does JSON output have all required fields?
if expected_schema:
checks.append(validate_json_schema(response, expected_schema))
# Check 3: Is the response semantically relevant to the query?
checks.append(semantic_similarity(query, response) > 0.3)
# Check 4: Is the response empty or boilerplate?
checks.append(len(response.content) > 20)
return all(checks)
If validation fails, treat it like a provider error — degrade or failover.
Putting It All Together
Here's what production-ready LLM API error handling looks like:
engine = nb.SelfHealingEngine()
# Configure multiple providers
engine.add_provider("openai", models=["gpt-4o", "gpt-4o-mini"])
engine.add_provider("anthropic", models=["claude-sonnet-4-20250514"])
engine.add_provider("deepseek", models=["deepseek-v4-flash"])
# Enable all 4 tiers: retry → degrade → failover → learn
result = await engine.call(
"Process this customer refund request",
fallback_strategy="cascade" # graceful degradation
)
When you call an LLM through this engine, it automatically:
- Retries on 429/500/timeout with smart backoff
- Degrades to a cheaper model under load
- Fails over to another provider when needed
- Validates every response for silent failures
- Learns from each failure to make future recovery faster
Summary
| Problem | Solution |
|---|---|
| 429 rate limits | Exponential backoff with jitter + Retry-After respect |
| Provider down | Multi-provider failover with health routing |
| Silent failures | 5-dimension contract validation |
| Production reliability | 4-tier cascading fallback strategy |
Don't write retry logic for every provider. Use a unified error handling SDK that handles all these cases in one import. Your code stays clean, your app stays up.
Built with NeuralBridge SDK — open-source Python LLM API error handling. One dependency, one line of code, zero gateways.
Top comments (0)