Mukunda Rao Katta

Posted on May 25

llm-circuit-breaker-py: Open the Circuit Before Your Agent Hammers a Down Provider

#hermeschallenge #ai #python #agents

I had a batch job running overnight. Forty tasks, each one calling Claude to summarize a document. The run started at midnight. By 2am, Anthropic had a partial outage. My code had retry logic: three attempts, exponential backoff. That sounds responsible.

What actually happened was this: every task hit the endpoint, got a timeout, waited 30 seconds, retried, waited 60 seconds, retried again, waited 120 seconds. Each task held a thread open for the full retry budget. With 40 concurrent tasks, that meant 40 threads sitting open for up to three minutes each. The process ran out of memory before the API recovered. I came in the next morning to a dead process, zero completed tasks, and a bill for all the API calls that timed out.

The problem was not the retry logic. Retry logic is correct. The problem was that the retry logic had no way to know the circuit was broken. A circuit breaker solves exactly this: after enough failures, stop touching the endpoint entirely. Let it recover. Probe once. Resume when it works. llm-circuit-breaker-py is that circuit breaker for Python, with sync and async support, zero dependencies, and thread-safe state tracking.

Shape of the fix

from llm_circuit_breaker_py import CircuitBreaker, CircuitOpenError

breaker = CircuitBreaker(
    failure_threshold=5,      # open after 5 consecutive failures
    recovery_timeout=60.0,    # wait 60s before probing in HalfOpen
    success_threshold=1,      # close again after 1 successful probe
)

# Sync usage
try:
    result = breaker.call(lambda: claude_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    ))
except CircuitOpenError as e:
    print(f"Circuit is open. Retry after {e.retry_after:.0f}s")
    result = serve_fallback()

# Async usage: same config, different method
try:
    result = await breaker.async_call(lambda: async_claude_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    ))
except CircuitOpenError as e:
    print(f"Circuit is open. Retry after {e.retry_after:.0f}s")
    result = serve_fallback()

# Query state directly without making a call
if breaker.is_open():
    return cached_result_or_fallback()

# Read current state and failure count for metrics
state = breaker.state        # "closed", "open", or "half_open"
failures = breaker.failures  # consecutive failure count

The CircuitOpenError includes a retry_after field in seconds. You can surface that to callers or use it to schedule a retry attempt instead of immediately calling a fallback.

What it does NOT do

It does not retry calls for you. That is a separate concern. llm-retry-py handles exponential backoff and per-call retries. The circuit breaker sits above the retry layer: once retries are exhausted and counted as a failure, the circuit tracks that failure. If you want retry plus circuit breaking, wrap the retried call in the breaker, not the other way around. It also does not implement per-provider error classification. It counts whatever exceptions your wrapped function raises. You decide which exceptions mean "the endpoint is broken" by catching provider-specific exceptions before they reach the breaker or by passing them through. It does not do distributed state: the breaker is in-process only. Two processes each get their own independent circuit.

Inside the lib

The state machine has three states: Closed, Open, and HalfOpen. In Closed state, calls go through and failures are counted. When the consecutive failure count reaches failure_threshold, the breaker transitions to Open. In Open state, every call raises CircuitOpenError immediately without touching the wrapped function. No thread held, no timeout waited, no memory ballooned.

After recovery_timeout seconds have passed since the circuit opened, the next incoming call transitions the breaker to HalfOpen and lets that call through as a probe. If the probe succeeds, the breaker returns to Closed and resets the failure count. If the probe fails, it returns to Open and resets the recovery timer. The success_threshold parameter controls how many consecutive successes in HalfOpen are required before the breaker fully closes. Default is 1.

Thread safety is handled with a threading.Lock that is held only during state reads and transitions, not during the actual wrapped call. The wrapped call runs outside the lock so that a long-running API call does not block other threads from reading circuit state. For async usage, a separate asyncio.Lock is used in async_call. The two variants do not share a lock, so you can safely mix sync and async calls against the same breaker instance if your workload calls for it.

23 tests cover: closed-to-open transition at threshold, open fast-fail behavior, HalfOpen probe success and failure paths, recovery timeout expiry, retry_after accuracy, async state transitions, thread-safe concurrent access under a ThreadPoolExecutor, and the is_open() / state inspection API.

When useful

Agent loops with retry logic that are currently burning threads and memory during sustained outages
Batch jobs where "fail fast and move on" is better than "retry and wait" when the provider is down
Multi-tenant services where one broken provider should not cause timeout queuing across all tenants
Any code that polls is_open() to route around a known-broken endpoint to a cached result or secondary provider
Situations where you want retry_after exposed to callers so they can schedule their own retry rather than blocking

When not useful

One-off scripts: the state machine is pointless if you are making one API call
Cases where failures are random and independent rather than correlated: a circuit breaker trips on patterns, and random independent failures will false-trip it
Workloads where you need per-endpoint circuits with different thresholds: you can instantiate multiple breakers, but this library does not manage a registry of them
Distributed deployments where circuit state needs to be shared across processes: use a distributed store for that

Install

pip install llm-circuit-breaker-py

Zero dependencies. Python 3.9+. No provider SDK required. Works with any callable that raises exceptions on failure.

Siblings

Library	Language	What it does
llm-circuit-breaker	Rust	Original Rust implementation with the same Closed/Open/HalfOpen model
llm-retry-py	Python	Exponential backoff with full jitter, sits below the circuit breaker
llm-fallback-router	Python	Route to a secondary provider when the primary circuit opens
agent-deadline	Python	Cooperative per-task time deadline, pairs with circuit breaking
llm-stop-conditions	Python	Composable conditions for ending an agent loop, including on circuit open
llm-fallback-chain	Python	Ordered failover across multiple providers after circuit trips

What's next

The two obvious additions are state-change callbacks (on_open, on_close, on_half_open) and a sliding-window failure rate instead of consecutive failure count. Callbacks would make it easy to emit a metric or alert when the circuit opens in production. You would call your logging or metrics function inside on_open and then know in real time how long each outage window lasted. Sliding window handles soft degradation better than consecutive count does: if 60% of requests fail but not in a strict sequence, consecutive count never trips, but sliding window will. For now, consecutive count ships because it is simpler to reason about and it handles the hard-outage scenario that caused this library to exist. The sliding-window variant is the next branch.

Part of the Hermes Agent Challenge sprint. Source at github.com/MukundaKatta/llm-circuit-breaker-py. PyPI: pip install llm-circuit-breaker-py.

DEV Community