Mukunda Rao Katta

Posted on May 25

My LLM provider went down for 11 minutes. My code spent 4 of them in connect timeouts.

#hermeschallenge #ai #llm #rust

Last month Anthropic had an 11-minute regional incident. My agent kept calling. Every call sat in connect timeout for 30 seconds before failing. By the time my retry logic gave up, the agent had spent four minutes of those eleven just waiting for sockets to close.

That is not a model outage problem. That is my problem. The model was clearly down. I had no reason to keep calling.

A circuit breaker fixes this. Once you see N consecutive failures, you stop trying for a while. The next call returns instantly with a "circuit is open" error instead of opening another doomed connection. After a timeout, you let one call through to test. If that one succeeds, you re-open the circuit. If not, you wait again.

llm-circuit-breaker is the small Rust crate I wrote that does exactly that. It is on crates.io as llm-circuit-breaker. The full surface is three states and two thresholds.

The shape of the fix

use llm_circuit_breaker::{CircuitBreaker, Config, State};
use std::time::Duration;

let cb = CircuitBreaker::new(Config {
    failure_threshold: 5,        // 5 consecutive failures opens the circuit
    timeout: Duration::from_secs(30),  // stay open 30 seconds
    half_open_max_calls: 1,      // let 1 call through to test
});

// In your call path
match cb.call(|| call_claude(prompt)) {
    Ok(response) => use_response(response),
    Err(BreakerError::Open) => return fallback_or_fail_fast(),
    Err(BreakerError::Inner(e)) => log_and_retry(e),
}

That is the whole integration. One wrap around the existing call.

The three states

Closed is the normal state. Calls pass through. Failures get counted. Hit the threshold and the circuit transitions to Open.

Open rejects calls immediately with BreakerError::Open. No network call, no socket wait, no token cost on a request that was always going to fail. After the configured timeout elapses, the circuit transitions to HalfOpen.

HalfOpen lets a small number of calls through (default one). If they succeed, transition back to Closed. If any fail, transition back to Open for another timeout.

The whole state machine is a few lines.

What it does NOT do

It is not a retry library. If you want to retry transient failures, use llm-retry alongside this. Circuit breaker is about giving up faster on persistent failures, not about persisting through transient ones.
It is not a rate limiter. If you want to throttle calls per second, use token-budget-pool or a separate semaphore.
It is not provider-aware. If a 429 from Anthropic should open the circuit but a 429 from OpenAI should retry, you express that with two different CircuitBreaker instances.
It is not a fallback chain. If a circuit opens and you want to fall through to a backup provider, see llm-fallback-router.

Inside the lib: one design choice worth showing

The hard part of a circuit breaker is not the state machine. It is who counts as a failure.

A 400 Bad Request from the model provider is your fault. The model is fine. Counting that as a circuit-breaker failure means a malformed prompt opens the circuit and blocks every other call site. That is wrong.

A 502 Bad Gateway or a connect timeout is the provider's fault. Counting those is exactly what the breaker is for.

The crate's answer is a failure_classifier: fn(&E) -> bool callback. You decide which errors count.

let cb = CircuitBreaker::new(Config {
    failure_threshold: 5,
    timeout: Duration::from_secs(30),
    half_open_max_calls: 1,
    failure_classifier: Box::new(|e: &MyError| match e {
        MyError::HttpStatus(s) => *s >= 500 || *s == 429,
        MyError::Timeout => true,
        MyError::Connect => true,
        MyError::BadRequest(_) => false,  // caller's fault, don't trip
        MyError::AuthFailed => false,     // we are misconfigured, not the provider
    }),
});

A 400 returned from the inner closure does not count toward the failure threshold. A 502, a timeout, a connect error, or a 429 does. The breaker trips on the right signal.

When this is useful

You call an LLM provider from a hot path and a regional outage would otherwise burn budget on timeouts.
You wrap an MCP tool that occasionally goes down and want to fail fast instead of holding up the agent loop.
You run a batch agent over thousands of items and want one bad provider blip to fail fast for the remaining items rather than retry each one.

When this is NOT what you want

For a single-call CLI where one failure is fatal anyway. The breaker has no state to accumulate.
For an interactive chat where you would rather show a spinner than a fast-fail error. The breaker is about saving time and money on doomed calls, not about UX.

Install

[dependencies]
llm-circuit-breaker = "0.1"

Repo: https://github.com/MukundaKatta/llm-circuit-breaker

Sibling libraries

Lib	Boundary	Repo
llm-circuit-breaker	Fail fast on persistent failure	this repo
llm-retry	Retry transient failures with backoff	https://github.com/MukundaKatta/llm-retry
llm-fallback-router	Switch providers when one fails	https://github.com/MukundaKatta/llm-fallback-router
token-budget-pool	Concurrent token + USD budget cap	https://github.com/MukundaKatta/token-budget-pool

What's next

A built-in tokio interceptor so users do not have to write the wrap themselves for the common case. Already prototyped. Will land in v0.2 once I have a clean trait for the wrapped future.

DEV Community