DEV Community

Deep Mehta
Deep Mehta

Posted on

Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

You've shipped your AI feature. Users love it. Then OpenAI returns a 429 Too Many Requests at 2pm on a Tuesday and your entire product goes down.

If you've built anything on LLM APIs, you've felt this pain. The irony is that we solved this exact problem in distributed systems fifteen years ago. Circuit breakers, health checks, failover chains — these patterns are standard in every microservice architecture. But most LLM integrations today are just raw API calls with maybe a try/except block and a prayer.

I'm an SRE who spent the last decade building reliability into production systems at scale. When I started building AI products, I was shocked at how fragile the infrastructure layer was. So I applied what I knew.

This post covers the patterns that actually work.

The Problem: LLM APIs Are Unreliable by Design

LLM providers aren't like your typical REST API. They have unique failure modes:

  • Rate limits are aggressive and unpredictable. OpenAI's 429s can hit mid-conversation with no warning.
  • Latency variance is extreme. The same prompt might take 800ms or 12 seconds depending on load.
  • Outages are frequent. Every major provider has had multi-hour outages in the past year.
  • Degraded performance is silent. A model might respond but with noticeably worse quality during high load.

Most developers handle this with retry logic:

# The "prayer-based reliability" approach
for attempt in range(3):
    try:
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=messages
        )
        return response
    except Exception:
        time.sleep(2 ** attempt)
raise Exception("All retries failed")
Enter fullscreen mode Exit fullscreen mode

This is better than nothing, but it has serious problems. You're waiting through all retries before failing (potentially 15+ seconds of user-facing latency). You're hammering a provider that's already struggling. And you're not learning anything from the failures.

Pattern 1: The Circuit Breaker

The circuit breaker pattern comes from electrical engineering — when current exceeds safe levels, the breaker trips and stops the flow. In software, it means: if a service is failing, stop sending it traffic immediately instead of waiting for each request to time out.

Here's the state machine:

CLOSED (healthy)
  │
  ├── Request succeeds → stay CLOSED, reset failure count
  │
  └── Request fails → increment failure count
        │
        └── Failures >= threshold? → trip to OPEN
                                       │
OPEN (broken)                          │
  │                                    │
  ├── All requests → instant fail      │
  │   (no API call made)               │
  └── Cool-down timer expires?         │
        │                              │
        └── Move to HALF-OPEN ─────────┘
                │
HALF-OPEN (testing)
  │
  ├── Send ONE probe request
  │     │
  │     ├── Succeeds → back to CLOSED
  │     └── Fails → back to OPEN (reset timer)
  │
  └── All other requests → instant fail
Enter fullscreen mode Exit fullscreen mode

In practice, this means:

class CircuitBreaker:
    def __init__(self, failure_threshold=3, cool_down_seconds=30):
        self.state = "CLOSED"
        self.failure_count = 0
        self.threshold = failure_threshold
        self.cool_down = cool_down_seconds
        self.last_failure_time = None

    def can_execute(self):
        if self.state == "CLOSED":
            return True

        if self.state == "OPEN":
            # Check if cool-down period has passed
            elapsed = time.time() - self.last_failure_time
            if elapsed >= self.cool_down:
                self.state = "HALF_OPEN"
                return True  # Allow one probe request
            return False  # Fail fast

        if self.state == "HALF_OPEN":
            return False  # Only one probe at a time

        return False

    def record_success(self):
        self.failure_count = 0
        self.state = "CLOSED"

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.threshold:
            self.state = "OPEN"
Enter fullscreen mode Exit fullscreen mode

The key insight: when a circuit is open, you fail instantly — zero latency wasted. Your user gets a fast fallback instead of watching a spinner for 30 seconds while retries pile up against a dead endpoint.

Pattern 2: Failover Chains

A circuit breaker alone just tells you something is broken. You still need somewhere to send the traffic. That's where failover chains come in.

Instead of one model, you define an ordered list:

failover_chain = [
    {"model": "gpt-5.2", "provider": "openai"},
    {"model": "claude-sonnet-4.5", "provider": "anthropic"},
    {"model": "gemini-3-flash", "provider": "google"},
]
Enter fullscreen mode Exit fullscreen mode

The routing logic walks the chain:

async def route_with_failover(messages, chain):
    route_trace = []

    for i, target in enumerate(chain):
        model = target["model"]
        breaker = get_circuit_breaker(model)

        # Skip if circuit is open
        if not breaker.can_execute():
            route_trace.append({
                "model": model,
                "action": "skipped",
                "reason": "circuit_open"
            })
            continue

        try:
            start = time.monotonic()
            response = await call_model(model, messages)
            latency = time.monotonic() - start

            breaker.record_success()
            update_latency_tracker(model, latency)

            route_trace.append({
                "model": model,
                "action": "success",
                "latency_ms": int(latency * 1000)
            })

            return response, route_trace

        except RateLimitError:
            breaker.record_failure()
            route_trace.append({
                "model": model,
                "action": "failed",
                "reason": "rate_limit_429"
            })
            continue

        except Exception as e:
            breaker.record_failure()
            route_trace.append({
                "model": model,
                "action": "failed",
                "reason": str(e)
            })
            continue

    raise AllProvidersFailedError(route_trace)
Enter fullscreen mode Exit fullscreen mode

The route trace is important — it gives you observability into what happened. A typical trace might look like:

{
  "trace": [
    {"model": "gpt-5.2", "action": "failed", "reason": "rate_limit_429"},
    {"model": "claude-sonnet-4.5", "action": "success", "latency_ms": 1847}
  ],
  "saved_time": "~12.4s vs waiting for rate limit reset"
}
Enter fullscreen mode Exit fullscreen mode

Your user got a response in under 2 seconds instead of waiting for OpenAI's rate limit to reset (typically 20-60 seconds).

Pattern 3: Latency Tracking With Exponential Smoothing

Static failover chains aren't enough. Model performance changes throughout the day. You need to track which models are fast right now and route accordingly.

Simple averaging doesn't work because it weighs a latency spike from yesterday the same as current performance. Instead, use exponential smoothing:

def update_latency(model, new_latency_ms):
    alpha = 0.2  # Weight for new measurement
    current_avg = get_avg_latency(model)

    # Recent measurements matter more
    smoothed = (current_avg * (1 - alpha)) + (new_latency_ms * alpha)

    set_avg_latency(model, smoothed)
    return smoothed
Enter fullscreen mode Exit fullscreen mode

With alpha = 0.2, a model that recovers from a slow period will show improved latency within 5-6 requests. A model that suddenly degrades will show it within 2-3 requests. This is much more responsive than a simple moving average.

You can use this to make routing decisions:

def select_fastest_healthy_model(chain):
    candidates = []
    for target in chain:
        model = target["model"]
        breaker = get_circuit_breaker(model)
        if breaker.can_execute():
            candidates.append({
                "model": model,
                "avg_latency": get_avg_latency(model)
            })

    # Sort by latency, pick fastest
    candidates.sort(key=lambda x: x["avg_latency"])
    return candidates[0]["model"] if candidates else None
Enter fullscreen mode Exit fullscreen mode

Pattern 4: Multi-Strategy Routing

In production, you don't always want the fastest model. Sometimes you want the cheapest. Sometimes you need the most reliable. The routing strategy should be configurable per request or per use case.

Define weight vectors for different goals:

ROUTING_STRATEGIES = {
    "balanced":    {"success_rate": 0.4, "latency": 0.3, "cost": 0.3},
    "speed":       {"success_rate": 0.2, "latency": 0.6, "cost": 0.2},
    "cost":        {"success_rate": 0.2, "latency": 0.2, "cost": 0.6},
    "reliability": {"success_rate": 0.7, "latency": 0.2, "cost": 0.1},
}

def score_model(model, strategy="balanced"):
    weights = ROUTING_STRATEGIES[strategy]
    stats = get_model_stats(model)

    # Normalize metrics (higher = better)
    inv_latency = 1.0 - (stats["avg_latency"] / max_latency)
    inv_cost = 1.0 - (stats["avg_cost"] / max_cost)

    score = (
        weights["success_rate"] * stats["success_rate"] +
        weights["latency"] * inv_latency +
        weights["cost"] * inv_cost
    )

    return score
Enter fullscreen mode Exit fullscreen mode

This lets you set "optimization_goal": "cost" for a background batch job and "optimization_goal": "speed" for a real-time chat interface — using the same infrastructure.

Pattern 5: Replay Testing

Here's the pattern most people skip, and it's arguably the most valuable: testing routing policy changes against historical traffic before deploying them.

The concept is simple. You log every request (model chosen, latency, cost, success/failure). When you want to change your routing policy — say, switching from "balanced" to "speed" or adding a new model to your chain — you replay recent traffic through the new policy and compare:

def replay_test(historical_requests, old_policy, new_policy):
    old_results = simulate(historical_requests, old_policy)
    new_results = simulate(historical_requests, new_policy)

    return {
        "cost_delta": new_results["total_cost"] - old_results["total_cost"],
        "latency_delta": new_results["avg_latency"] - old_results["avg_latency"],
        "success_rate_delta": (
            new_results["success_rate"] - old_results["success_rate"]
        ),
    }
Enter fullscreen mode Exit fullscreen mode

This prevents the "I changed the routing and now everything is slower" surprise. You get confidence in the change before it hits production.

Putting It All Together

Here's what a production-ready LLM routing layer looks like when you combine these patterns:

Request comes in
    │
    ├── Check optimization goal (speed/cost/balanced/reliability)
    │
    ├── Score available models using strategy weights
    │   (filtered by circuit breaker state)
    │
    ├── Route to top-scored model
    │     │
    │     ├── Success → update latency tracker, return response
    │     │
    │     └── Failure → record failure, update circuit breaker
    │           │
    │           └── Try next model in failover chain
    │                 │
    │                 ├── Success → return with route trace
    │                 └── All failed → return error with full trace
    │
    └── Log everything for replay testing
Enter fullscreen mode Exit fullscreen mode

Each request produces a route trace showing exactly what happened. Your monitoring dashboard can show which models are healthy, which circuits are open, and where your latency is going.

The Practical Takeaway

If you're building on LLM APIs and you're not using these patterns, you're one provider outage away from a bad day. The good news is that none of this is complicated — these are well-understood patterns that have been battle-tested in distributed systems for over a decade.

Start with the circuit breaker. It's the highest-impact, lowest-effort change. Then add a two-model failover chain. Then add latency tracking. Each layer compounds the reliability.

I built LLMWise specifically to give developers this entire reliability stack out of the box — circuit breakers, failover chains, latency-optimized routing, and replay testing — without building it from scratch. But whether you use a managed solution or build your own, these patterns should be part of any production LLM integration.

The full technical documentation covering all the algorithms in detail is at llmwise.ai/llms-full.txt if you want to dig deeper.


What reliability patterns are you using in your LLM integrations? I'd love to hear what's working (or breaking) in your setup.

Top comments (0)