Our retry loop made an outage worse. The circuit breaker stopped the cascade.

#hermesagent #ai #llm #rust

A few weeks back there was a 22-minute window where Anthropic returned a high rate of 5xx responses. Not a full outage. Degraded.

Our agent service had a retry policy that backed off and tried again on 5xx. With six workers and a shared retry budget that I had set too high, we were re-issuing failed calls about as fast as the API was returning errors. When the API recovered, we had a backlog of in-flight retries that pushed us straight back into rate limiting.

Total cost of the bad decision: about 18,000 wasted Anthropic calls and 9 minutes of additional recovery time after their side came back. Nothing user-visible blew up, but I felt bad about it.

I wrote llm-circuit-breaker the next day. It is small. The whole crate is under 400 lines of Rust. It pairs with llm-retry.

The state machine

              failures >= threshold
   +-------+ ----------------------> +------+
   | Closed|                          | Open |
   +-------+ <---------------------- +------+
       ^      half_open success       |
       |                               |
       |     half_open failure         v
       |   <----------------------- +-----------+
       +---------------------------- | HalfOpen  |
            cooldown elapsed         +-----------+

Closed: calls go through. Failures are counted.
Open: calls return BreakerError::Open immediately without hitting the API. After a cooldown, the breaker transitions to HalfOpen.
HalfOpen: exactly one trial call is allowed through. If it succeeds, back to Closed. If it fails, back to Open with the cooldown reset.

That is the entire state machine. No leaky bucket. No fancy sliding window. Just enough to stop a runaway retry from making a partial outage into a full one.

What it looks like in code

use llm_circuit_breaker::{Breaker, BreakerConfig};
use std::time::Duration;

let breaker = Breaker::new(BreakerConfig {
    failure_threshold: 5,
    success_threshold: 1,
    cooldown: Duration::from_secs(30),
});

let result = breaker.call(|| async {
    client.messages().create(payload).await
}).await;

match result {
    Ok(resp) => handle(resp),
    Err(BreakerError::Open) => {
        // skip the call entirely, return cached or fallback
    }
    Err(BreakerError::Inner(e)) => {
        // the upstream failed; the breaker counted it
    }
}

Everything is Arc<Mutex<...>> inside. Safe to share across tasks. There is also an is_open() for cheap pre-check without taking the lock long.

Threshold tuning, honestly

I tuned the numbers by accident at first and got them wrong. Here is what worked after a few iterations.

For Anthropic messages.create from a single-worker process:

failure_threshold: 5. Three is too jumpy. Ten is too slow. Five is a few seconds of bad calls before tripping.
cooldown: Duration::from_secs(30). Long enough that you do not spam the half-open probe. Short enough that recovery is fast.
success_threshold: 1. One good response is enough to close. The breaker is not a health system, it is a stop-loss.

For a multi-worker pool sharing one breaker (which is what we have in production now):

Scale the failure threshold by sqrt of worker count, not linearly. Six workers do not need 30 failures to trip. They need maybe 12.
Keep the cooldown the same. Cooldown is about the upstream, not your side.

What I learned the hard way: a per-worker breaker is not the same as a shared breaker. Per-worker, every worker has to learn the upstream is down independently. Shared, one worker's failure protects the others. We use one shared breaker per upstream provider now.

Composing with retry

llm-retry does exponential backoff with jitter. llm-circuit-breaker cuts off retries when the breaker is open. The two together prevent both the rare-flake case (retry handles it) and the cascading-failure case (breaker handles it).

use llm_retry::{retry, RetryConfig};
use llm_circuit_breaker::Breaker;

let cfg = RetryConfig {
    max_attempts: 4,
    base_delay: Duration::from_millis(500),
    max_delay: Duration::from_secs(8),
    ..Default::default()
};

retry(cfg, || async {
    breaker.call(|| async { client.messages().create(p.clone()).await }).await
}).await

If the breaker is open, the inner closure returns BreakerError::Open immediately. llm-retry sees that as non-retryable (you configure it that way) and bails out fast. No retry storm.

Numbers from the dry-run

I ran a simulation against a mock server that returns 5xx for the first 60 seconds and then 200s.

Without breaker, with a 4-attempt retry policy: 1,140 wasted requests during the bad window, full backlog hit on recovery.
With breaker (threshold=5, cooldown=30s): 19 wasted requests, recovered cleanly with one half-open probe.

That is the difference between "we noticed the outage" and "we made the outage worse for ourselves."

What this does not solve

It does not detect that an upstream is slow but not erroring. If every call takes 25 seconds but eventually returns 200, the breaker stays closed. You would want a latency-based tripwire for that, and I have not added one. agenttrace-rs will at least surface the p95 so you notice.
It is not coordinated across processes. Each replica of your service has its own breaker. If you want a globally-coordinated breaker, you need a shared store (Redis, etc.) and this crate is not it.
The cooldown is fixed. There is no adaptive cooldown that grows on repeated trips. I considered it and decided fixed-cooldown is honest enough.
Half-open allows exactly one probe. If your traffic is very high and you do not want the recovery to bottleneck on a single probe, you would want a token-bucket half-open. Not implemented.

The crate has no async-runtime lock-in. It works under tokio, async-std, or sync (the closure shape changes).

Repo: https://github.com/MukundaKatta/llm-circuit-breaker
crates.io: llm-circuit-breaker = "0.1"

Part of a small stack of Rust crates I publish for the unglamorous LLM plumbing (retry, budget, repair, cost, trace). Pairs cleanly with llm-retry and agenttrace-rs.