I spent last Tuesday watching my AI pipeline melt down in slow motion.
It started with a transient 503 from the LLM API. My retry logic kicked in — that's fine, right? But the LLM was already struggling under load, and my retries just made it worse. Meanwhile, downstream services started timing out because they were waiting for my pipeline. By the time I killed the process, three different services had cascaded into failure.
The lesson: retries without circuit breakers are just amplified damage.
The anatomy of an AI pipeline failure
Here's what happened, step by step:
- The LLM API returned a 503 (service temporarily unavailable)
- My retry logic fired after 2 seconds
- The API was still down, returned another 503
- Retry #2 fired, then #3, then #4 — each one adding load to an already strained system
- My service's connection pool filled up
- Other endpoints started timing out
- The monitoring system flagged my service as unhealthy
- Kubernetes restarted the pod — losing in-flight requests
The 503 was the spark. But the fire was caused by my retry logic.
What a circuit breaker buys you
A circuit breaker sits between your code and the API. It tracks failures and, when the failure rate exceeds a threshold, it "opens" the circuit — meaning all subsequent calls fail immediately without hitting the API at all.
This gives three critical benefits:
Fast failure: When the circuit is open, requests fail in milliseconds instead of waiting for a 30-second timeout. Your users get an error response instantly rather than hanging.
API protection: By stopping retries when the API is already struggling, you prevent your client from becoming part of the problem. This is especially important with LLM APIs that queue requests — your retries are literally making the outage worse for everyone.
Graceful degradation: When the circuit opens, you can fall back to a simpler model, cached responses, or a user-friendly error message. The system doesn't break — it degrades.
The implementation I use
Here's the pattern I've settled on after trying several approaches:
import time
import random
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation, requests flow through
OPEN = "open" # Circuit tripped, requests fail fast
HALF_OPEN = "half_open" # Testing if service recovered
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30,
half_open_max_calls=1):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = None
self.half_open_calls = 0
def can_execute(self):
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if self.last_failure_time and time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
return True
return False
# HALF_OPEN
return self.half_open_calls < self.half_open_max_calls
def record_success(self):
self.failure_count = 0
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
def get_state(self):
return self.state.value
Usage is straightforward:
breaker = CircuitBreaker(
failure_threshold=3, # Trip after 3 failures
recovery_timeout=60, # Wait 60s before testing
half_open_max_calls=1 # Test with one call
)
def call_llm(prompt):
if not breaker.can_execute():
raise RuntimeError("Circuit breaker open — service degraded")
try:
response = llm_client.complete(prompt)
breaker.record_success()
return response
except Exception as e:
breaker.record_failure()
raise
The retry strategy that doesn't make things worse
Circuit breakers handle the "should I retry?" question. But when you do retry, the strategy matters enormously.
Exponential backoff with jitter is essential. A fixed delay causes a "thundering herd" — all clients retry at the same time, overwhelming the recovering service. Random jitter spreads retries across time.
import random
def wait_before_retry(attempt, base_delay=1.0, max_delay=60.0):
delay = min(max_delay, base_delay * (2 ** attempt))
jitter = random.uniform(0, delay * 0.5)
return delay + jitter
Never retry write operations. If you're generating content, creating records, or triggering side effects, retries can create duplicates. Use idempotency keys or deduplication logic.
Cap the total retry time. Don't let retries consume your entire timeout budget. Reserve 20-30% of your timeout for the final attempt.
What I'd do differently
I wish I'd implemented circuit breakers from day one. Instead, I spent three months debugging intermittent latency spikes that were actually symptoms of uncircuited API calls. The pattern is simple enough that there's no excuse for not having it in place.
The biggest misconception I had was that "the API will recover on its own." It does — but by the time it does, your connection pool is exhausted and your error rate is spiking. A circuit breaker gives the API room to recover without your client amplifying the problem.
When NOT to use circuit breakers
- Internal service calls with SLAs: If you control both sides of the call, fixing the root cause is better than hiding it behind a breaker.
- Fire-and-forget metrics: If a failed call just means "skip this metric," a breaker adds unnecessary complexity.
- Systems with built-in resilience: Some managed AI services (like the one at ai.interwestinfo.com) handle their own load balancing and queuing. In those cases, a client-side breaker may be redundant.
The bottom line
AI pipelines are distributed systems. They fail. The question isn't if they'll fail, but how they'll fail.
Circuit breakers don't prevent failures — they prevent failure cascades. And in a world where LLM APIs can be unreliable, that distinction is everything.
What's your approach to handling AI API failures? I'm curious what patterns others have found useful.
Top comments (0)