Last month Anthropic had an 11-minute regional incident. My agent kept calling. Every call sat in connect timeout for 30 seconds before failing. By the time my retry logic gave up, the agent had spent four minutes of those eleven just waiting for sockets to close.
That is not a model outage problem. That is my problem. The model was clearly down. I had no reason to keep calling.
A circuit breaker fixes this. Once you see N consecutive failures, you stop trying for a while. The next call returns instantly with a "circuit is open" error instead of opening another doomed connection. After a timeout, you let one call through to test. If that one succeeds, you re-open the circuit. If not, you wait again.
llm-circuit-breaker is the small Rust crate I wrote that does exactly that. It is on crates.io as llm-circuit-breaker. The full surface is three states and two thresholds.
The shape of the fix
use llm_circuit_breaker::{CircuitBreaker, Config, State};
use std::time::Duration;
let cb = CircuitBreaker::new(Config {
failure_threshold: 5, // 5 consecutive failures opens the circuit
timeout: Duration::from_secs(30), // stay open 30 seconds
half_open_max_calls: 1, // let 1 call through to test
});
// In your call path
match cb.call(|| call_claude(prompt)) {
Ok(response) => use_response(response),
Err(BreakerError::Open) => return fallback_or_fail_fast(),
Err(BreakerError::Inner(e)) => log_and_retry(e),
}
That is the whole integration. One wrap around the existing call.
The three states
Closed is the normal state. Calls pass through. Failures get counted. Hit the threshold and the circuit transitions to Open.
Open rejects calls immediately with BreakerError::Open. No network call, no socket wait, no token cost on a request that was always going to fail. After the configured timeout elapses, the circuit transitions to HalfOpen.
HalfOpen lets a small number of calls through (default one). If they succeed, transition back to Closed. If any fail, transition back to Open for another timeout.
The whole state machine is a few lines.
What it does NOT do
- It is not a retry library. If you want to retry transient failures, use
llm-retryalongside this. Circuit breaker is about giving up faster on persistent failures, not about persisting through transient ones. - It is not a rate limiter. If you want to throttle calls per second, use
token-budget-poolor a separate semaphore. - It is not provider-aware. If a 429 from Anthropic should open the circuit but a 429 from OpenAI should retry, you express that with two different
CircuitBreakerinstances. - It is not a fallback chain. If a circuit opens and you want to fall through to a backup provider, see
llm-fallback-router.
Inside the lib: one design choice worth showing
The hard part of a circuit breaker is not the state machine. It is who counts as a failure.
A 400 Bad Request from the model provider is your fault. The model is fine. Counting that as a circuit-breaker failure means a malformed prompt opens the circuit and blocks every other call site. That is wrong.
A 502 Bad Gateway or a connect timeout is the provider's fault. Counting those is exactly what the breaker is for.
The crate's answer is a failure_classifier: fn(&E) -> bool callback. You decide which errors count.
let cb = CircuitBreaker::new(Config {
failure_threshold: 5,
timeout: Duration::from_secs(30),
half_open_max_calls: 1,
failure_classifier: Box::new(|e: &MyError| match e {
MyError::HttpStatus(s) => *s >= 500 || *s == 429,
MyError::Timeout => true,
MyError::Connect => true,
MyError::BadRequest(_) => false, // caller's fault, don't trip
MyError::AuthFailed => false, // we are misconfigured, not the provider
}),
});
A 400 returned from the inner closure does not count toward the failure threshold. A 502, a timeout, a connect error, or a 429 does. The breaker trips on the right signal.
When this is useful
- You call an LLM provider from a hot path and a regional outage would otherwise burn budget on timeouts.
- You wrap an MCP tool that occasionally goes down and want to fail fast instead of holding up the agent loop.
- You run a batch agent over thousands of items and want one bad provider blip to fail fast for the remaining items rather than retry each one.
When this is NOT what you want
- For a single-call CLI where one failure is fatal anyway. The breaker has no state to accumulate.
- For an interactive chat where you would rather show a spinner than a fast-fail error. The breaker is about saving time and money on doomed calls, not about UX.
Install
[dependencies]
llm-circuit-breaker = "0.1"
Repo: https://github.com/MukundaKatta/llm-circuit-breaker
Sibling libraries
| Lib | Boundary | Repo |
|---|---|---|
| llm-circuit-breaker | Fail fast on persistent failure | this repo |
| llm-retry | Retry transient failures with backoff | https://github.com/MukundaKatta/llm-retry |
| llm-fallback-router | Switch providers when one fails | https://github.com/MukundaKatta/llm-fallback-router |
| token-budget-pool | Concurrent token + USD budget cap | https://github.com/MukundaKatta/token-budget-pool |
What's next
A built-in tokio interceptor so users do not have to write the wrap themselves for the common case. Already prototyped. Will land in v0.2 once I have a clean trait for the wrapped future.
Top comments (0)