Stop calling a dead API. Shed load fast, recover automatically, and stay consistent across restarts with Redis-backed failure state.
Why this matters
Every LLM-powered application depends on an external provider - OpenAI, Anthropic, Google, or a self-hosted model. These providers go down. Rate limits spike. Latency balloons. Without a circuit breaker, your application keeps sending requests into a black hole, burning through your budget, stacking up timeouts, and delivering a terrible experience to every user in the queue.
A circuit breaker detects that the downstream service is failing and stops trying for a cooldown period. This is not about retrying harder - it's about failing fast and deliberately so the rest of your system stays healthy.
The problem
Without a circuit breaker: When your LLM provider starts returning 429s or 500s, every new user request still attempts a full API call. Each call waits for a timeout (often 30-60 seconds). Your concurrency pool fills up. Healthy requests get queued behind doomed ones. Your entire application appears frozen.
Naive approach vs production approach
| Naive: retry and hope | Production: circuit breaker |
|---|---|
| Retry every failed request 3 times | Track failure count in a window |
| Log the error and move on | Trip open after N failures |
| No memory of past failures | Reject instantly while open |
| Each request rediscovers the outage | Probe with single test request |
| Timeouts pile up, pool exhausted | Close when probe succeeds |
How I implemented it
I implemented a circuit breaker in the NL2SQL agent that wraps every LLM provider call. When the failure count within a sliding window exceeds a threshold, the breaker trips open and all subsequent requests return an error immediately - no API call, no timeout, no wasted concurrency slot.
# Pseudo-code: circuit breaker wrapping an LLM call
# not a production-grade circuit breaker, sliding window not shown
class CircuitBreaker:
def __init__(self, threshold=5, cooldown_sec=60):
self.state = "CLOSED"
self.failure_count = 0
self.threshold = threshold
self.cooldown_sec = cooldown_sec
self.last_failure_time = None
async def call(self, fn, *args):
if self.state == "OPEN":
if time_since(self.last_failure_time) > self.cooldown_sec:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Provider unavailable")
try:
result = await fn(*args)
if self.state == "HALF_OPEN":
self.reset()
return result
except ProviderError:
self._record_failure()
raise
def _record_failure(self):
self.failure_count += 1
self.last_failure_time = now()
if self.failure_count >= self.threshold:
self.state = "OPEN"
def reset(self):
self.state = "CLOSED"
self.failure_count = 0
Key design choice: The circuit breaker state is stored in Redis, not in-process memory. This matters because in a multi-replica deployment, one replica discovering the outage should protect all replicas from burning through the same dead endpoint. Without shared state, each pod independently rediscovers the failure.
Bug story: the in-process fallback
Bug: During local development, Redis wasn't always running. The circuit breaker tried to read state from Redis, failed, and threw an unhandled exception - crashing the entire request before it even reached the LLM provider.
The fix: detect Redis connection failure and fall back to an in-process circuit breaker with the same interface. This is a classic example of a reliability mechanism introducing its own failure mode.
The lesson is important: every reliability layer must itself have a fallback. If your circuit breaker depends on Redis, and Redis is down, your circuit breaker shouldn't make things worse. The in-process fallback loses cross-replica consistency but keeps the application functional.
Generalized lesson
Circuit breakers aren't specific to LLM applications. They appear anywhere you call an external service that can fail: payment processors, search indices, notification services, databases. The pattern is the same:
The general pattern: Track failures within a window. When failures cross a threshold, stop calling the service. After a cooldown, send one probe. If it works, resume. If it doesn't, extend the cooldown. Always degrade gracefully - never let a dead dependency take down your entire system.
How to apply in other projects
If you're wrapping any external API call, you can introduce a circuit breaker in three steps. First, wrap the call in a try/except that increments a failure counter. Second, before each call, check the counter - if it's above your threshold and the cooldown hasn't elapsed, return an error immediately. Third, after the cooldown, allow one request through and reset if it succeeds.
For single-process applications, an in-memory counter is sufficient. For distributed systems, shared state can be useful when you want replicas to coordinate breaker behavior. Redis is a common choice. A database-backed approach can also work in some systems, while per-instance breakers are still sufficient for many deployments depending on traffic shape and failure tolerance.
Common mistakes
No cooldown backoff. A fixed 60-second cooldown means the breaker reopens and gets punched again immediately during a sustained outage. Use exponential backoff on the cooldown duration.
Counting all errors equally. A 429 (rate limit) is different from a 500 (server error). Rate limits often clear within seconds - tripping a 60-second breaker for a 429 is overkill. Differentiate transient vs persistent failures.
Forgetting the fallback for the breaker itself. If your circuit breaker state lives in Redis and Redis goes down, you have two things broken instead of one. Always have an in-process fallback.
Notes / production caveats
This post focuses on the pattern, not a fully hardened implementation. The pseudo-code is intentionally simplified: it does not show a true sliding window, concurrency control, single-flight probing in half-open, backoff strategy, or differentiated handling for different error classes.
A few practical caveats are worth calling out:
- Shared Redis-backed state is useful in multi-replica systems, but half-open coordination needs care. Without guardrails, multiple replicas can probe the dependency at once and create noisy recovery behavior.
- Redis is one valid production design, not the only one. Many systems work well with per-instance breakers combined with load-shedding, jittered retries, and strict client-side timeouts.
- For distributed coordination, Redis is a practical option. A database-backed approach can also work in some systems, but a shared file is usually not a serious production coordination mechanism.
- Failing fast should usually be paired with a fallback path: degraded mode, cached responses, queueing, or explicit messaging that the provider is temporarily unavailable.
Iām still learning these reliability patterns by applying them in real projects. If you have suggestions, corrections, or better ways to think about this, Iād genuinely appreciate your feedback. Thank You!


Top comments (0)