Mukunda Rao Katta

Posted on May 25

Stop Hammering a Broken API: Circuit Breakers for LLM Calls in Rust

#hermeschallenge #ai #rust #agents

The outage that made things worse

Anthropic had a partial outage. Twenty minutes, partial degradation, some requests timing out at the edge. Normal stuff. What was not normal was what happened on our side.

The retry logic kept firing. Each retry hit the broken endpoint and waited 30 seconds before timing out. We had 40 concurrent agents running. That is 40 threads, each holding a 30-second timeout, each retrying up to 3 times. Do the math: 40 threads times 3 retries times 30 seconds each. The service did not wait for the outage to end. It OOMed first.

The original problem was a 20-minute API partial outage. The actual disaster was self-inflicted, caused by retry logic that had no awareness of the circuit state. Retry is good. Retry without a circuit breaker is a foot gun.

The shape of the fix

A circuit breaker sits in front of your API call. It tracks consecutive failures and has three states:

Closed: Normal operation. Calls go through. Failures get counted.
Open: Too many consecutive failures. All calls fail fast without touching the API. No timeouts, no retries eating memory.
HalfOpen: After a configured timeout, the circuit allows exactly one probe call through. If it succeeds, back to Closed. If it fails, back to Open.

use llm_circuit_breaker::{CircuitBreaker, Config};

let cb = CircuitBreaker::new(Config {
    failure_threshold: 5,   // open after 5 consecutive failures
    success_threshold: 1,   // close after 1 success in half-open
    timeout_secs: 30,       // wait 30s before probing
});

let result = cb.call(|| async {
    client.messages().create(params.clone()).await
}).await;

match result {
    Ok(response) => handle(response),
    Err(CircuitError::Open) => serve_fallback(),  // fail fast, no API call made
    Err(CircuitError::ApiError(e)) => handle_api_error(e),
}

The call method checks the circuit state before touching the network. If the circuit is Open, it returns CircuitError::Open immediately. No timeout. No thread held. No memory balloon.

What this is NOT

This is not a retry library. llm-retry handles exponential backoff and per-call retries within the closed state. The circuit breaker sits above that layer.

This is not a rate limiter. Rate limiters control request volume from the client side. A circuit breaker responds to failure signals from the server side.

This is not a health check system. You do not need a separate ping endpoint. The circuit is probed by real traffic, which is a deliberate choice I will explain below.

Inside the lib

The state machine is straightforward. The interesting part is thread safety without a global lock on the hot path.

pub struct CircuitBreaker {
    state: Arc<Mutex<State>>,
    config: Config,
}

enum State {
    Closed { consecutive_failures: u32 },
    Open { opened_at: Instant },
    HalfOpen,
}

The Mutex is only locked to read or transition state, not during the actual API call. The call itself runs without holding any lock. This means 40 concurrent agents share one circuit breaker without one agent's API timeout blocking another from reading circuit state.

State transitions:

Closed, failure recorded, count reaches failure_threshold: transition to Open, record timestamp.
Open, time elapsed since opened_at exceeds timeout_secs: transition to HalfOpen on next call attempt.
HalfOpen, success: transition to Closed, reset failure count.
HalfOpen, failure: transition to Open, reset timestamp.

The success_threshold config is there for cases where you want two or three consecutive successes before fully trusting the circuit again. Default is 1.

The probe design choice

Here is the part that is worth pausing on. In HalfOpen state, the probe is the caller's next real request, not a synthetic health check.

Why? Synthetic health checks require you to define what a healthy response looks like, maintain a separate check endpoint or lightweight request format, and deal with the case where the health check passes but your actual request shape still fails.

Using real traffic as the probe sidesteps all of that. The circuit recovers exactly when real traffic can flow, which is the actual condition you care about. The downside is that one real request gets to fail if the probe fails. For most LLM agent workloads, that is acceptable because you are handling errors anyway and the probe failure re-opens the circuit immediately rather than letting a flood of requests through.

If your workload has a request that is too expensive to use as a probe, you can wrap a lightweight version in the call closure. The circuit breaker does not care what is inside the closure.

When this is useful

Agent loops with retries. Without a circuit breaker, a broken API endpoint turns every retry budget into a timeout budget. You get the worst of both: slow failure and resource exhaustion.

Multi-tenant services. If one tenant's request pattern trips the circuit, other tenants still get fast fails instead of timeout queuing.

Fallback routing. When the circuit is Open, you know immediately. You can route to a secondary provider, serve a cached response, or degrade gracefully. The circuit state is queryable:

if cb.is_open() {
    return serve_cached_or_fallback();
}

When NOT to use this

Single-request scripts. If you are running a one-off prompt, retry is enough. The circuit breaker adds state management overhead that is pointless without concurrency.

If your API failures are not correlated. Circuit breakers are designed for cascading failure scenarios where the API is genuinely degraded. If failures are random and independent, a circuit breaker will false-trip during normal noise.

If you need per-model-endpoint circuits. This crate gives you one circuit per instance. You can instantiate multiple breakers, one per provider or endpoint, and compose them yourself.

Install

[dependencies]
llm-circuit-breaker = "0.1"

GitHub: MukundaKatta/llm-circuit-breaker

crates.io: llm-circuit-breaker

Siblings

Crate	What it does
llm-retry	Exponential backoff + per-call retry budgets
llm-budget-window	Time-windowed token/USD budgets
token-budget-pool	Shared concurrent token cap across agents
llm-fallback-router	Route to secondary provider when primary fails

What is next

The obvious extension is per-state callbacks: on_open, on_close, on_half_open. Useful for metrics and alerting. The internal state transition points are already clean enough that adding callbacks would be a small diff.

A sliding window failure rate (instead of consecutive count) is also on the list. Consecutive failures are simple and work well for hard outages. Sliding window handles soft degradation better, where 60% of requests fail but not in a row.

For now, the simple version ships. You can wire it up in five minutes, and it will save you from the OOM scenario that caused this crate to exist in the first place.

DEV Community