Mukunda Rao Katta

Posted on May 25

llm-retry-py: Full-Jitter Retry for LLM Calls in Python

#hermeschallenge #ai #python #agents

The Anthropic API returned a 529. I had never seen a 529 before. The agent I was running was mid-task: it had already done three tool calls, gathered results, and was about to synthesize a final answer. The 529 crashed the whole thing. I had to restart the job manually. Ten minutes later I looked up the status code: 529 means "API temporarily overloaded, please retry." The API was literally asking my code to wait a moment and try again. My code had no idea how to do that.

I added a quick retry loop in the spot where the failure happened. It worked, and then I had the same problem in a different part of the codebase two weeks later. Then again in the async path. I ended up with three slightly different retry implementations, each with slightly different behavior, each missing something the others had. One used time.sleep(2) with no jitter. One had jitter but did not distinguish retryable errors from non-retryable errors. The async one just had asyncio.sleep(1) because I was in a hurry.

The standard shape for LLM retry is well understood: exponential backoff with full jitter, a cap on the maximum wait, a cap on the number of attempts, and a per-provider list of which errors are safe to retry. Writing it fresh every time is a waste. llm-retry-py packages it with presets for Anthropic, OpenAI, Bedrock, and Gemini, and gives you a decorator so you apply it in one line.

Shape of the fix

from llm_retry_py import retry_llm, AnthropicPreset

@retry_llm(preset=AnthropicPreset, max_attempts=5)
def call_claude(prompt: str) -> str:
    return anthropic_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

# Async version: same decorator, different prefix
from llm_retry_py import async_retry_llm, OpenAIPreset

@async_retry_llm(preset=OpenAIPreset, max_attempts=4)
async def call_gpt(prompt: str) -> str:
    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Custom config: bring your own retryable codes and timing
from llm_retry_py import RetryConfig

config = RetryConfig(
    retryable_status_codes=[429, 500, 529],
    retryable_exception_names=["overloaded_error", "ServiceUnavailable"],
    base_delay_s=1.0,
    max_delay_s=60.0,
    max_attempts=6,
)

@retry_llm(config=config)
def call_custom_provider(prompt: str) -> str:
    ...

# Bedrock and Gemini presets also available
from llm_retry_py import BedrockPreset, GeminiPreset

The preset knows which HTTP status codes and SDK exception type names are safe to retry for that provider. You do not need to look those up.

What it does NOT do

It does not fall over to a different provider when all retry attempts are exhausted. If the endpoint is down for ten minutes and max_attempts=5, it will retry five times and then raise the last exception. For cross-provider failover after exhausting retries, use llm-fallback-router or llm-fallback-chain. It does not implement a circuit breaker: there is no open/closed/half-open state machine that gives up after a pattern of failures. For that, use llm-circuit-breaker-py. It does not track retry counts across calls or across process restarts: each decorated call gets its own attempt counter that starts at 1. It also does not make any guarantees about idempotency: if your LLM call has side effects (writes to a database, sends a message), retrying it may replay those side effects.

Inside the lib

The jitter strategy is "full jitter" from the AWS exponential backoff blog post: sleep = random.uniform(0, min(max_delay_s, base_delay_s * 2 ** attempt)). This is deliberately randomized so that a fleet of workers that all hit a rate limit at the same moment do not all retry at the same moment. Without jitter you get a thundering herd: all workers sleep for the same two seconds, then all hammer the endpoint at the same time, then all get rate-limited again. Full jitter spreads the retries out across the window so the load arrives in a smooth curve instead of a spike.

The decorator inspects the raised exception on each attempt. For requests and httpx response errors, it checks the HTTP status code against the preset's retryable_status_codes. For provider SDK exceptions, it checks the exception class name and, where the SDK exposes it, the error.type or error code string. If the exception is not in the retryable list, the decorator re-raises it immediately without any delay or additional attempts. Only retryable exceptions trigger the wait-and-retry loop.

The async variant uses asyncio.sleep instead of time.sleep. The rest of the logic is identical: same jitter formula, same retryable-code check, same exception inspection. The two decorators (retry_llm and async_retry_llm) are separate so you apply the right one for your calling context rather than having the library guess.

31 tests cover: max attempts exhaustion, non-retryable exception pass-through, jitter bounds (zero to max_delay), async event loop behavior, each of the four provider presets, custom config composition, and the RetryConfig API.

When useful

Production agents where a crash on a transient 429 or 529 is unacceptable and you need it handled automatically
Batch jobs where retrying a few calls is much cheaper than restarting the whole run when one call bounces
Any codebase that currently has no retry logic and needs the standard pattern added quickly
Testing: the RetryConfig API lets you set max_attempts=1 in tests to disable retrying entirely without mocking time.sleep
Situations where you want per-provider behavior without reading through each provider's error code documentation yourself

When not useful

LLM calls that are not idempotent and must not be replicated because they have side effects
Situations where you need a circuit breaker that stops retrying after a sustained outage pattern
Providers where you need failover to a backup endpoint rather than retry of the same endpoint
Anywhere you need deterministic delay intervals without randomness, for example in tests that measure timing

Install

pip install llm-retry-py

Zero dependencies. Python 3.9+. No requests, httpx, or provider SDK required at import time: the decorator inspects exceptions by class name and attribute, so it works with any provider SDK that raises HTTP-shaped exceptions.

Siblings

Library	Language	What it does
llm-retry	Rust	Original Rust implementation with the same jitter strategy
llm-circuit-breaker-py	Python	Circuit breaker for LLM calls, open/closed/half-open state machine
llm-fallback-router	Python	Multi-provider failover after exhausting retries
llm-fallback-chain	Python	Ordered sync and async failover chain across providers
agent-deadline	Python	Cooperative per-task time deadline for agent loops
llm-rate-limit-bucket	Python	Token-bucket rate limiter to prevent hitting rate limits in the first place

What's next

The next feature I want is an on_retry callback so callers can log each attempt with the attempt number, the exception, and the next sleep duration. Right now the retry behavior is silent. A structured log on each retry would make it much easier to see in production how often you are hitting rate limits and for how long. If you have a provider preset you want added, open an issue and include the status codes and exception names.

Part of the Hermes Agent Challenge sprint. Source at github.com/MukundaKatta/llm-retry-py. PyPI: pip install llm-retry-py.

DEV Community