DEV Community

zhongqiyue
zhongqiyue

Posted on

When AI Calls Go Wrong: Circuit Breakers, Retries, and Failing Smart

I spent last Tuesday watching my AI pipeline melt down in slow motion.

It started with a transient 503 from the LLM API. My retry logic kicked in — that's fine, right? But the LLM was already struggling under load, and my retries just made it worse. Meanwhile, downstream services started timing out because they were waiting for my pipeline. By the time I killed the process, three different services had cascaded into failure.

The lesson: retries without circuit breakers are just amplified damage.

The anatomy of an AI pipeline failure

Here's what happened, step by step:

  1. The LLM API returned a 503 (service temporarily unavailable)
  2. My retry logic fired after 2 seconds
  3. The API was still down, returned another 503
  4. Retry #2 fired, then #3, then #4 — each one adding load to an already strained system
  5. My service's connection pool filled up
  6. Other endpoints started timing out
  7. The monitoring system flagged my service as unhealthy
  8. Kubernetes restarted the pod — losing in-flight requests

The 503 was the spark. But the fire was caused by my retry logic.

What a circuit breaker buys you

A circuit breaker sits between your code and the API. It tracks failures and, when the failure rate exceeds a threshold, it "opens" the circuit — meaning all subsequent calls fail immediately without hitting the API at all.

This gives three critical benefits:

Fast failure: When the circuit is open, requests fail in milliseconds instead of waiting for a 30-second timeout. Your users get an error response instantly rather than hanging.

API protection: By stopping retries when the API is already struggling, you prevent your client from becoming part of the problem. This is especially important with LLM APIs that queue requests — your retries are literally making the outage worse for everyone.

Graceful degradation: When the circuit opens, you can fall back to a simpler model, cached responses, or a user-friendly error message. The system doesn't break — it degrades.

The implementation I use

Here's the pattern I've settled on after trying several approaches:

import time
import random
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation, requests flow through
    OPEN = "open"          # Circuit tripped, requests fail fast
    HALF_OPEN = "half_open"  # Testing if service recovered

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30, 
                 half_open_max_calls=1):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls

        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.half_open_calls = 0

    def can_execute(self):
        if self.state == CircuitState.CLOSED:
            return True

        if self.state == CircuitState.OPEN:
            if self.last_failure_time and                time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
                return True
            return False

        # HALF_OPEN
        return self.half_open_calls < self.half_open_max_calls

    def record_success(self):
        self.failure_count = 0
        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.CLOSED

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.OPEN

    def get_state(self):
        return self.state.value
Enter fullscreen mode Exit fullscreen mode

Usage is straightforward:

breaker = CircuitBreaker(
    failure_threshold=3,      # Trip after 3 failures
    recovery_timeout=60,      # Wait 60s before testing
    half_open_max_calls=1     # Test with one call
)

def call_llm(prompt):
    if not breaker.can_execute():
        raise RuntimeError("Circuit breaker open — service degraded")

    try:
        response = llm_client.complete(prompt)
        breaker.record_success()
        return response
    except Exception as e:
        breaker.record_failure()
        raise
Enter fullscreen mode Exit fullscreen mode

The retry strategy that doesn't make things worse

Circuit breakers handle the "should I retry?" question. But when you do retry, the strategy matters enormously.

Exponential backoff with jitter is essential. A fixed delay causes a "thundering herd" — all clients retry at the same time, overwhelming the recovering service. Random jitter spreads retries across time.

import random

def wait_before_retry(attempt, base_delay=1.0, max_delay=60.0):
    delay = min(max_delay, base_delay * (2 ** attempt))
    jitter = random.uniform(0, delay * 0.5)
    return delay + jitter
Enter fullscreen mode Exit fullscreen mode

Never retry write operations. If you're generating content, creating records, or triggering side effects, retries can create duplicates. Use idempotency keys or deduplication logic.

Cap the total retry time. Don't let retries consume your entire timeout budget. Reserve 20-30% of your timeout for the final attempt.

What I'd do differently

I wish I'd implemented circuit breakers from day one. Instead, I spent three months debugging intermittent latency spikes that were actually symptoms of uncircuited API calls. The pattern is simple enough that there's no excuse for not having it in place.

The biggest misconception I had was that "the API will recover on its own." It does — but by the time it does, your connection pool is exhausted and your error rate is spiking. A circuit breaker gives the API room to recover without your client amplifying the problem.

When NOT to use circuit breakers

  • Internal service calls with SLAs: If you control both sides of the call, fixing the root cause is better than hiding it behind a breaker.
  • Fire-and-forget metrics: If a failed call just means "skip this metric," a breaker adds unnecessary complexity.
  • Systems with built-in resilience: Some managed AI services (like the one at ai.interwestinfo.com) handle their own load balancing and queuing. In those cases, a client-side breaker may be redundant.

The bottom line

AI pipelines are distributed systems. They fail. The question isn't if they'll fail, but how they'll fail.

Circuit breakers don't prevent failures — they prevent failure cascades. And in a world where LLM APIs can be unreliable, that distinction is everything.

What's your approach to handling AI API failures? I'm curious what patterns others have found useful.

Top comments (0)