You hit a 429. Your agent is retrying with exponential backoff. The backoff is 2 seconds, 4 seconds, 8 seconds. Meanwhile, the rest of your system is still sending requests. Each new request also hits 429. Each new request also backs off. Your API key is in a 429 spiral.
The root cause is not the retry strategy. It is that your outbound request rate exceeded what the provider allows. You need rate limiting on the sending side, not just better retry behavior.
llm-rate-limit-bucket is a token-bucket rate limiter for outbound LLM calls.
The Shape of the Fix
from llm_rate_limit_bucket import RateLimitBucket
# Anthropic's claude-sonnet-4-6 tier-1 limit: 50 RPM
bucket = RateLimitBucket(rate=50, per=60) # 50 requests per 60 seconds
async def call_llm_rate_limited(**kwargs) -> dict:
await bucket.acquire() # Wait until a token is available
return await anthropic_client.messages.create(**kwargs)
Before every LLM call, acquire() waits until the rate limit allows the request. If you are under the limit, it returns immediately. If you are at the limit, it waits the appropriate amount of time before returning.
What It Does NOT Do
llm-rate-limit-bucket does not share state across processes. The bucket is in-memory per process. If you have five workers, each has its own bucket at 50 RPM — the combined rate is 250 RPM. For shared rate limiting across processes, use Redis with the INCR+EXPIRE pattern or a distributed rate limiter.
It does not adapt to 429 responses automatically. When you receive a 429, you should still backoff via llm-retry. The rate limiter prevents you from reaching the limit in the first place; the retry handles cases where you do hit it anyway.
It does not model token-per-minute (TPM) limits. It limits requests per time window (RPM). Providers also impose TPM limits. Large requests (high token count) can exhaust TPM even if RPM stays under limit. For TPM-aware rate limiting, you need to track token counts.
Inside the Library
The token bucket algorithm: a bucket fills at a fixed rate (tokens/second). Each request consumes one token. If the bucket is empty, wait until a token is available:
import asyncio
import time
class RateLimitBucket:
def __init__(self, rate: float, per: float = 1.0):
self._rate = rate # tokens added per window
self._per = per # window size in seconds
self._tokens = rate # start full
self._last_refill = time.monotonic()
self._lock = asyncio.Lock()
def _refill(self) -> None:
now = time.monotonic()
elapsed = now - self._last_refill
new_tokens = elapsed * (self._rate / self._per)
self._tokens = min(self._rate, self._tokens + new_tokens)
self._last_refill = now
async def acquire(self, tokens: float = 1.0) -> None:
async with self._lock:
while True:
self._refill()
if self._tokens >= tokens:
self._tokens -= tokens
return
# Calculate wait time
deficit = tokens - self._tokens
wait = deficit / (self._rate / self._per)
# Release lock while waiting
self._lock.release()
await asyncio.sleep(wait)
await self._lock.acquire()
def sync_acquire(self) -> None:
"""Synchronous version for non-async contexts."""
while True:
self._refill()
if self._tokens >= 1.0:
self._tokens -= 1.0
return
deficit = 1.0 - self._tokens
wait = deficit / (self._rate / self._per)
time.sleep(wait)
The bucket starts full. Early calls go through immediately. When the bucket is empty, calls sleep exactly long enough for one token to refill. This prevents bursting over the limit and avoids over-waiting.
For TPM-aware limiting, pass a token count to acquire():
# Anthropic TPM limit: 100K tokens/minute
tpm_bucket = RateLimitBucket(rate=100_000, per=60)
# Before a request with estimated 5K tokens:
await tpm_bucket.acquire(tokens=5_000)
When to Use It
Use it in any multi-agent or multi-worker system where concurrent workers share an API key. Each worker alone may stay under the RPM limit, but five workers at the same rate can easily hit it. A shared bucket (in-process across coroutines, or Redis for cross-process) prevents combined burst.
Use it in event-driven agents that process queues. If your queue bursts (100 events arrive at once), and each event triggers an LLM call, you will 429 without rate limiting. The bucket spreads the calls out to stay under the RPM limit.
Use it for testing and development against production API keys. Rate limiting your test runs prevents them from consuming quota that production traffic needs.
Skip it for single-threaded, single-request agents with well-spaced requests. If requests arrive one at a time and you have spare RPM headroom, rate limiting adds latency without preventing any real problem.
Install
pip install git+https://github.com/MukundaKatta/llm-rate-limit-bucket
# Or from PyPI
pip install llm-rate-limit-bucket
from llm_rate_limit_bucket import RateLimitBucket
# Match your API tier limits
rpm_bucket = RateLimitBucket(rate=50, per=60) # 50 RPM
tpm_bucket = RateLimitBucket(rate=40_000, per=60) # 40K TPM
async def call_anthropic_safe(messages: list[dict], model: str, max_tokens: int = 1024) -> dict:
# Estimate tokens for TPM limit check
estimated_tokens = sum(
len(str(m.get("content", ""))) // 4
for m in messages
) + max_tokens
# Acquire both RPM and TPM limits
await rpm_bucket.acquire()
await tpm_bucket.acquire(tokens=estimated_tokens)
return await anthropic_client.messages.create(
model=model,
messages=messages,
max_tokens=max_tokens,
)
Sibling Libraries
| Library | What it solves |
|---|---|
llm-retry |
Exponential backoff retry for 429s that still occur |
llm-circuit-breaker-py |
Open circuit when provider is overwhelmed |
llm-fallback-chain |
Route to another provider when rate limited |
token-budget-pool |
Track cumulative token usage against a budget |
agent-budget-coordinator |
Coordinate rate limits with other budget constraints |
The reliability stack: llm-rate-limit-bucket prevents hitting limits, llm-retry handles limits that slip through, llm-circuit-breaker-py opens the circuit when the provider is down, llm-fallback-chain routes to an alternative.
What's Next
Adaptive rate adjustment: when a 429 is received, automatically lower the effective rate by 20% for 60 seconds. Recover slowly as requests succeed. This makes the bucket self-tuning instead of requiring manual calibration.
Multi-provider buckets: RateLimitRegistry({"anthropic": {...}, "openai": {...}}) that holds one bucket per provider. When routing to a provider, acquire from that provider's bucket. When a provider is exhausted, the registry can redirect to another with remaining capacity.
Burst mode: RateLimitBucket(rate=50, per=60, burst=100) that allows a burst of up to 100 immediate requests while still averaging to 50 RPM over 60 seconds. This models provider limits more accurately (providers often allow short bursts above the sustained rate).
Built as part of the agent-stack family: composable Python primitives for production LLM agents.
Top comments (0)