DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

llm-retry: Full-Jitter Exponential Backoff for LLM API Calls in Rust

The thundering herd

Rolled a custom retry loop. Fixed 2-second delays. Every worker waited the same 2 seconds after a 429 burst.

After the wait, all of them retried at the same instant. The retry flood hit the API harder than the original traffic. The 429s came back. The loop ran again. Fixed delays caused the retries to stay synchronized across the whole fleet.

This is the thundering herd problem. The fix is full-jitter.

Shape of the fix

[dependencies]
llm-retry = "0.1"
# For async support:
tokio = { version = "1", features = ["full"] }
Enter fullscreen mode Exit fullscreen mode

Sync usage:

use llm_retry::{RetryConfig, Provider, retry_sync};

let config = RetryConfig {
    provider: Provider::Anthropic,
    max_attempts: 5,
    base_delay_ms: 500,
    max_delay_ms: 30_000,
};

let result = retry_sync(&config, || {
    my_anthropic_client.complete(&prompt)
})?;
Enter fullscreen mode Exit fullscreen mode

Async usage with tokio:

use llm_retry::{RetryConfig, Provider, retry_async};

let config = RetryConfig {
    provider: Provider::Anthropic,
    max_attempts: 5,
    base_delay_ms: 500,
    max_delay_ms: 30_000,
};

let result = retry_async(&config, || async {
    my_anthropic_client.complete(&prompt).await
}).await?;
Enter fullscreen mode Exit fullscreen mode

Use a different provider:

let config = RetryConfig {
    provider: Provider::OpenAI,
    max_attempts: 4,
    base_delay_ms: 1_000,
    max_delay_ms: 60_000,
};

let config = RetryConfig {
    provider: Provider::Bedrock,
    max_attempts: 6,
    base_delay_ms: 250,
    max_delay_ms: 20_000,
};
Enter fullscreen mode Exit fullscreen mode

Custom retryable codes:

use llm_retry::{RetryConfig, Provider, RetryableCheck};

let config = RetryConfig {
    provider: Provider::Custom(RetryableCheck::StatusCodes(vec![429, 503, 504])),
    max_attempts: 5,
    base_delay_ms: 500,
    max_delay_ms: 30_000,
};
Enter fullscreen mode Exit fullscreen mode

What it does NOT do

  • No circuit breaking. Retrying indefinitely against a down provider is the circuit breaker problem. Compose with llm-circuit-breaker or llm-circuit-breaker-py for that.
  • No Retry-After header parsing. Some APIs return a Retry-After header with a specific wait time. This crate uses jitter-based backoff regardless. Honoring Retry-After is on the roadmap.
  • No fallback provider switching. If all retries fail, you get an error. For cross-provider failover, compose with llm-fallback-chain.
  • No per-request retry budgets. If you need to cap total retry time across concurrent requests, compose with agent-deadline.

Inside the lib

Full-jitter vs plain exponential backoff is worth explaining once.

Plain exponential backoff: after attempt N, wait base * 2^N milliseconds. Deterministic. If 100 clients all hit a 429 at time T, they all wait the same amount and retry at time T + base * 2^N. They stay synchronized. The retry flood arrives as a spike.

Full-jitter: after attempt N, wait a random value uniformly sampled from [0, base * 2^N]. Each client picks a different delay. If 100 clients all hit a 429 at time T, their retries arrive spread across the interval [T, T + base * 2^N] at roughly uniform density. The API sees a smooth arrival curve instead of a spike.

The math comes from the AWS Builder's Library post on exponential backoff and jitter. The conclusion is that full-jitter minimizes mean retry latency under load compared to all other common backoff strategies including decorrelated jitter.

The implementation:

delay = random(0, min(max_delay_ms, base_delay_ms * 2^attempt))
Enter fullscreen mode Exit fullscreen mode

The min(max_delay_ms, ...) cap prevents the delay from growing past a configured ceiling. Without the cap, 2^attempt grows without bound and after enough failures you are waiting hours.

The retryable status code lists per provider:

  • Anthropic: 429, 529 (overloaded), 500 (server error), 503.
  • OpenAI: 429, 500, 503.
  • Bedrock: 429, 500, 502, 503, 504 (plus Bedrock-specific throttling codes).
  • Gemini: 429, 500, 503.

Non-retryable codes (400, 401, 403, 404) are returned immediately. There is no point retrying a bad request or an auth failure.

The Provider enum is the mechanism for selecting the code list. This avoids a string comparison on every attempt.

When useful

  • Any Rust application making LLM API calls where you need retry on rate limits or transient errors.
  • Fleet applications where many workers share an API key and thundering-herd retry is a real risk.
  • Eval pipelines that run many requests in parallel and need resilient calls without manual retry boilerplate.
  • Applications that target multiple providers and want a single retry abstraction.

When NOT

  • If you are making a single LLM call per user request and latency is critical, retry adds tail latency. Consider whether a fast fail and a user-visible error is better than an invisible retry.
  • If the error is persistent (wrong API key, invalid model name), retry wastes time. The built-in non-retryable code list handles this for status codes, but application-level semantic errors need application-level handling.
  • If you already have a retry library in your stack that handles status codes, do not add a second one. Compose at the circuit-breaker or fallback layer instead.

Install

[dependencies]
llm-retry = "0.1"
Enter fullscreen mode Exit fullscreen mode

Crates.io: llm-retry
GitHub: MukundaKatta/llm-retry

Siblings

Lib Boundary Repo
llm-circuit-breaker Closed/Open/HalfOpen state machine to stop retrying a down provider MukundaKatta/llm-circuit-breaker
llm-fallback-chain Ordered provider failover when all retries for a provider fail MukundaKatta/llm-fallback-chain
agent-deadline Cooperative per-task deadline; pair with retry to cap total wait time MukundaKatta/agent-deadline
llm-retry-py Python port of this crate with async support MukundaKatta/llm-retry-py
token-budget-pool Shared token/USD budget; stops retries once the budget is exhausted MukundaKatta/token-budget-pool

What is next

Retry-After header support is the clearest missing piece. When the API returns a Retry-After: 30 header, the client should wait exactly 30 seconds on that attempt rather than computing a jitter delay. Honoring the server's guidance is strictly better than ignoring it.

A RetryEvent callback parameter would also help. Right now, retry attempts are silent. Being able to log attempt number, delay, and status code from a caller-provided closure would make debugging rate limit behavior much easier in production.


Part of the Hermes Agent Challenge sprint. All crates shipped on crates.io.

Top comments (0)