DEV Community

hhhfs9s7y9-code
hhhfs9s7y9-code

Posted on

Python LLM API Error Handling: A Complete Guide to 429 Rate Limits, Retries, and Failover

Python LLM API Error Handling: A Complete Guide to 429 Rate Limits, Retries, and Failover

If you're building AI-powered applications in Python, you've probably hit this wall: your LLM provider returns a 429 (rate limit), a 502 (bad gateway), or just hangs until timeout. The first time it happens, you add a time.sleep(). The second time, you write a retry loop. By the tenth time, you're wondering if there's a better way to handle LLM API errors in production.

This guide covers the three layers of LLM API error handling every Python developer needs to know: retry logic, multi-provider failover, and fallback strategies.


Layer 1: Handle 429 Rate Limits and Transient Errors

The most common LLM API error is the 429 Rate Limit. Every provider has them — OpenAI, Anthropic, DeepSeek. The naive fix is:

import time
import openai

def call_with_retry(prompt, max_retries=3):
    for i in range(max_retries):
        try:
            return openai.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}]
            )
        except openai.RateLimitError:
            time.sleep(2 ** i)  # exponential backoff
    raise Exception("All retries exhausted")
Enter fullscreen mode Exit fullscreen mode

This works but has problems: it doesn't respect the Retry-After header, doesn't distinguish between error types, and when retries are exhausted, your app still fails.

The Right Way: Exponential Backoff with Jitter

Production retry logic needs:

  • Exponential backoff — double the wait between each attempt
  • Jitter — randomize the wait to avoid thundering herd
  • Retry-After respect — honor the provider's specified wait time
  • Error classification — treat 429 (retryable) differently from 401 (not retryable)

A Python implementation:

import asyncio
import random
from openai import RateLimitError, APITimeoutError, APIError

async def smart_retry(coro, max_retries=5, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return await coro
        except RateLimitError as e:
            retry_after = int(e.response.headers.get("Retry-After", 0))
            wait = retry_after or (base_delay * (2 ** attempt) + random.uniform(0, 0.5))
            print(f"429 rate limit, retrying in {wait:.1f}s...")
            await asyncio.sleep(wait)
        except (APITimeoutError, APIError) as e:
            if attempt == max_retries - 1:
                raise
            wait = base_delay * (2 ** attempt)
            await asyncio.sleep(wait)
    raise Exception("Max retries exceeded")
Enter fullscreen mode Exit fullscreen mode

But this only handles transient errors. What if your provider is down for 30 minutes?


Layer 2: Multi-Provider Failover

A single-provider retry loop can't help when the provider itself is unavailable. The Claude outage of June 2026 took Anthropic offline for 3 hours. OpenAI has had multi-hour partial outages. DeepSeek experiences periodic congestion.

Multi-provider failover means your application automatically switches to a backup provider when the primary is unavailable.

Manual Approach

providers = [
    ("openai", "sk-..."),
    ("anthropic", "sk-ant-..."),
    ("deepseek", "sk-..."),
]

for name, key in providers:
    try:
        return await call_provider(name, key, prompt)
    except Exception as e:
        print(f"{name} failed: {e}, trying next...")
        continue
raise Exception("All providers failed")
Enter fullscreen mode Exit fullscreen mode

This is better, but still naive:

  • It doesn't test provider health before calling
  • It switches providers on any error, even retryable ones
  • No validation that the fallback output is actually correct
  • Latency adds up as you try each provider in sequence

The Production Pattern: Health Monitoring + Smart Routing

A production failover system should:

  1. Track per-provider error rates and latency (P50/P95/P99)
  2. Route to the healthiest provider, not just the first
  3. Distinguish between retryable errors (switch) and non-retryable (immediate failover)
  4. Validate output after failover — model responses differ between providers

Layer 3: LLM Fallback Strategy — Graceful Degradation

The most sophisticated error handling strategy is a cascading fallback:

Request → Retry (transient errors)
  → Model Degrade (switch to cheaper model in same provider)
    → Provider Failover (switch to different provider)
      → Flywheel Learning (record patterns for faster diagnosis)
Enter fullscreen mode Exit fullscreen mode

This means your application never just fails — it degrades gracefully:

  1. L1 — Smart Retry: 429 rate limit? Wait and retry with exponential backoff. Timeout? Retry once.
  2. L2 — Model Degrade: OpenAI GPT-4o keeps failing? Try GPT-4o-mini. Same API, lower cost, higher availability.
  3. L3 — Provider Failover: All OpenAI models failing? Switch to Anthropic Claude, then DeepSeek.
  4. L4 — Self-Learning: Record the failure pattern. Next time the same error appears, skip straight to the solution.

The Silent Failure Problem

There's a catch. Providers sometimes return 200 OK with garbage content — empty responses, "I cannot answer that" refusals, or JSON responses missing required fields. These are the most dangerous errors because your error handler thinks everything is fine.

A production fallback strategy must validate each response:

def validate_response(response, expected_schema=None):
    checks = []
    # Check 1: Was it a refusal disguised as a normal response?
    checks.append(not is_refusal(response))
    # Check 2: Does JSON output have all required fields?
    if expected_schema:
        checks.append(validate_json_schema(response, expected_schema))
    # Check 3: Is the response semantically relevant to the query?
    checks.append(semantic_similarity(query, response) > 0.3)
    # Check 4: Is the response empty or boilerplate?
    checks.append(len(response.content) > 20)
    return all(checks)
Enter fullscreen mode Exit fullscreen mode

If validation fails, treat it like a provider error — degrade or failover.


Putting It All Together

Here's what production-ready LLM API error handling looks like:

engine = nb.SelfHealingEngine()

# Configure multiple providers
engine.add_provider("openai", models=["gpt-4o", "gpt-4o-mini"])
engine.add_provider("anthropic", models=["claude-sonnet-4-20250514"])
engine.add_provider("deepseek", models=["deepseek-v4-flash"])

# Enable all 4 tiers: retry → degrade → failover → learn
result = await engine.call(
    "Process this customer refund request",
    fallback_strategy="cascade"  # graceful degradation
)
Enter fullscreen mode Exit fullscreen mode

When you call an LLM through this engine, it automatically:

  1. Retries on 429/500/timeout with smart backoff
  2. Degrades to a cheaper model under load
  3. Fails over to another provider when needed
  4. Validates every response for silent failures
  5. Learns from each failure to make future recovery faster

Summary

Problem Solution
429 rate limits Exponential backoff with jitter + Retry-After respect
Provider down Multi-provider failover with health routing
Silent failures 5-dimension contract validation
Production reliability 4-tier cascading fallback strategy

Don't write retry logic for every provider. Use a unified error handling SDK that handles all these cases in one import. Your code stays clean, your app stays up.


Built with NeuralBridge SDK — open-source Python LLM API error handling. One dependency, one line of code, zero gateways.

Top comments (0)