The thundering herd
Rolled a custom retry loop. Fixed 2-second delays. Every worker waited the same 2 seconds after a 429 burst.
After the wait, all of them retried at the same instant. The retry flood hit the API harder than the original traffic. The 429s came back. The loop ran again. Fixed delays caused the retries to stay synchronized across the whole fleet.
This is the thundering herd problem. The fix is full-jitter.
Shape of the fix
[dependencies]
llm-retry = "0.1"
# For async support:
tokio = { version = "1", features = ["full"] }
Sync usage:
use llm_retry::{RetryConfig, Provider, retry_sync};
let config = RetryConfig {
provider: Provider::Anthropic,
max_attempts: 5,
base_delay_ms: 500,
max_delay_ms: 30_000,
};
let result = retry_sync(&config, || {
my_anthropic_client.complete(&prompt)
})?;
Async usage with tokio:
use llm_retry::{RetryConfig, Provider, retry_async};
let config = RetryConfig {
provider: Provider::Anthropic,
max_attempts: 5,
base_delay_ms: 500,
max_delay_ms: 30_000,
};
let result = retry_async(&config, || async {
my_anthropic_client.complete(&prompt).await
}).await?;
Use a different provider:
let config = RetryConfig {
provider: Provider::OpenAI,
max_attempts: 4,
base_delay_ms: 1_000,
max_delay_ms: 60_000,
};
let config = RetryConfig {
provider: Provider::Bedrock,
max_attempts: 6,
base_delay_ms: 250,
max_delay_ms: 20_000,
};
Custom retryable codes:
use llm_retry::{RetryConfig, Provider, RetryableCheck};
let config = RetryConfig {
provider: Provider::Custom(RetryableCheck::StatusCodes(vec![429, 503, 504])),
max_attempts: 5,
base_delay_ms: 500,
max_delay_ms: 30_000,
};
What it does NOT do
- No circuit breaking. Retrying indefinitely against a down provider is the circuit breaker problem. Compose with
llm-circuit-breakerorllm-circuit-breaker-pyfor that. - No Retry-After header parsing. Some APIs return a
Retry-Afterheader with a specific wait time. This crate uses jitter-based backoff regardless. HonoringRetry-Afteris on the roadmap. - No fallback provider switching. If all retries fail, you get an error. For cross-provider failover, compose with
llm-fallback-chain. - No per-request retry budgets. If you need to cap total retry time across concurrent requests, compose with
agent-deadline.
Inside the lib
Full-jitter vs plain exponential backoff is worth explaining once.
Plain exponential backoff: after attempt N, wait base * 2^N milliseconds. Deterministic. If 100 clients all hit a 429 at time T, they all wait the same amount and retry at time T + base * 2^N. They stay synchronized. The retry flood arrives as a spike.
Full-jitter: after attempt N, wait a random value uniformly sampled from [0, base * 2^N]. Each client picks a different delay. If 100 clients all hit a 429 at time T, their retries arrive spread across the interval [T, T + base * 2^N] at roughly uniform density. The API sees a smooth arrival curve instead of a spike.
The math comes from the AWS Builder's Library post on exponential backoff and jitter. The conclusion is that full-jitter minimizes mean retry latency under load compared to all other common backoff strategies including decorrelated jitter.
The implementation:
delay = random(0, min(max_delay_ms, base_delay_ms * 2^attempt))
The min(max_delay_ms, ...) cap prevents the delay from growing past a configured ceiling. Without the cap, 2^attempt grows without bound and after enough failures you are waiting hours.
The retryable status code lists per provider:
- Anthropic: 429, 529 (overloaded), 500 (server error), 503.
- OpenAI: 429, 500, 503.
- Bedrock: 429, 500, 502, 503, 504 (plus Bedrock-specific throttling codes).
- Gemini: 429, 500, 503.
Non-retryable codes (400, 401, 403, 404) are returned immediately. There is no point retrying a bad request or an auth failure.
The Provider enum is the mechanism for selecting the code list. This avoids a string comparison on every attempt.
When useful
- Any Rust application making LLM API calls where you need retry on rate limits or transient errors.
- Fleet applications where many workers share an API key and thundering-herd retry is a real risk.
- Eval pipelines that run many requests in parallel and need resilient calls without manual retry boilerplate.
- Applications that target multiple providers and want a single retry abstraction.
When NOT
- If you are making a single LLM call per user request and latency is critical, retry adds tail latency. Consider whether a fast fail and a user-visible error is better than an invisible retry.
- If the error is persistent (wrong API key, invalid model name), retry wastes time. The built-in non-retryable code list handles this for status codes, but application-level semantic errors need application-level handling.
- If you already have a retry library in your stack that handles status codes, do not add a second one. Compose at the circuit-breaker or fallback layer instead.
Install
[dependencies]
llm-retry = "0.1"
Crates.io: llm-retry
GitHub: MukundaKatta/llm-retry
Siblings
| Lib | Boundary | Repo |
|---|---|---|
| llm-circuit-breaker | Closed/Open/HalfOpen state machine to stop retrying a down provider | MukundaKatta/llm-circuit-breaker |
| llm-fallback-chain | Ordered provider failover when all retries for a provider fail | MukundaKatta/llm-fallback-chain |
| agent-deadline | Cooperative per-task deadline; pair with retry to cap total wait time | MukundaKatta/agent-deadline |
| llm-retry-py | Python port of this crate with async support | MukundaKatta/llm-retry-py |
| token-budget-pool | Shared token/USD budget; stops retries once the budget is exhausted | MukundaKatta/token-budget-pool |
What is next
Retry-After header support is the clearest missing piece. When the API returns a Retry-After: 30 header, the client should wait exactly 30 seconds on that attempt rather than computing a jitter delay. Honoring the server's guidance is strictly better than ignoring it.
A RetryEvent callback parameter would also help. Right now, retry attempts are silent. Being able to log attempt number, delay, and status code from a caller-provided closure would make debugging rate limit behavior much easier in production.
Part of the Hermes Agent Challenge sprint. All crates shipped on crates.io.
Top comments (0)