Token-Bucket Rate Limiting for LLM Calls: Don't 429 Your Own Agent

#hermeschallenge #ai #python #agents

You hit a 429. Your agent is retrying with exponential backoff. The backoff is 2 seconds, 4 seconds, 8 seconds. Meanwhile, the rest of your system is still sending requests. Each new request also hits 429. Each new request also backs off. Your API key is in a 429 spiral.

The root cause is not the retry strategy. It is that your outbound request rate exceeded what the provider allows. You need rate limiting on the sending side, not just better retry behavior.

llm-rate-limit-bucket is a token-bucket rate limiter for outbound LLM calls.

The Shape of the Fix

from llm_rate_limit_bucket import RateLimitBucket

# Anthropic's claude-sonnet-4-6 tier-1 limit: 50 RPM
bucket = RateLimitBucket(rate=50, per=60)  # 50 requests per 60 seconds

async def call_llm_rate_limited(**kwargs) -> dict:
    await bucket.acquire()  # Wait until a token is available
    return await anthropic_client.messages.create(**kwargs)

Before every LLM call, acquire() waits until the rate limit allows the request. If you are under the limit, it returns immediately. If you are at the limit, it waits the appropriate amount of time before returning.

What It Does NOT Do

llm-rate-limit-bucket does not share state across processes. The bucket is in-memory per process. If you have five workers, each has its own bucket at 50 RPM — the combined rate is 250 RPM. For shared rate limiting across processes, use Redis with the INCR+EXPIRE pattern or a distributed rate limiter.

It does not adapt to 429 responses automatically. When you receive a 429, you should still backoff via llm-retry. The rate limiter prevents you from reaching the limit in the first place; the retry handles cases where you do hit it anyway.

It does not model token-per-minute (TPM) limits. It limits requests per time window (RPM). Providers also impose TPM limits. Large requests (high token count) can exhaust TPM even if RPM stays under limit. For TPM-aware rate limiting, you need to track token counts.

Inside the Library

The token bucket algorithm: a bucket fills at a fixed rate (tokens/second). Each request consumes one token. If the bucket is empty, wait until a token is available:

import asyncio
import time

class RateLimitBucket:
    def __init__(self, rate: float, per: float = 1.0):
        self._rate = rate        # tokens added per window
        self._per = per          # window size in seconds
        self._tokens = rate      # start full
        self._last_refill = time.monotonic()
        self._lock = asyncio.Lock()

    def _refill(self) -> None:
        now = time.monotonic()
        elapsed = now - self._last_refill
        new_tokens = elapsed * (self._rate / self._per)
        self._tokens = min(self._rate, self._tokens + new_tokens)
        self._last_refill = now

    async def acquire(self, tokens: float = 1.0) -> None:
        async with self._lock:
            while True:
                self._refill()
                if self._tokens >= tokens:
                    self._tokens -= tokens
                    return

                # Calculate wait time
                deficit = tokens - self._tokens
                wait = deficit / (self._rate / self._per)

                # Release lock while waiting
                self._lock.release()
                await asyncio.sleep(wait)
                await self._lock.acquire()

    def sync_acquire(self) -> None:
        """Synchronous version for non-async contexts."""
        while True:
            self._refill()
            if self._tokens >= 1.0:
                self._tokens -= 1.0
                return
            deficit = 1.0 - self._tokens
            wait = deficit / (self._rate / self._per)
            time.sleep(wait)

The bucket starts full. Early calls go through immediately. When the bucket is empty, calls sleep exactly long enough for one token to refill. This prevents bursting over the limit and avoids over-waiting.

For TPM-aware limiting, pass a token count to acquire():

# Anthropic TPM limit: 100K tokens/minute
tpm_bucket = RateLimitBucket(rate=100_000, per=60)

# Before a request with estimated 5K tokens:
await tpm_bucket.acquire(tokens=5_000)

When to Use It

Use it in any multi-agent or multi-worker system where concurrent workers share an API key. Each worker alone may stay under the RPM limit, but five workers at the same rate can easily hit it. A shared bucket (in-process across coroutines, or Redis for cross-process) prevents combined burst.

Use it in event-driven agents that process queues. If your queue bursts (100 events arrive at once), and each event triggers an LLM call, you will 429 without rate limiting. The bucket spreads the calls out to stay under the RPM limit.

Use it for testing and development against production API keys. Rate limiting your test runs prevents them from consuming quota that production traffic needs.

Skip it for single-threaded, single-request agents with well-spaced requests. If requests arrive one at a time and you have spare RPM headroom, rate limiting adds latency without preventing any real problem.

Install

pip install git+https://github.com/MukundaKatta/llm-rate-limit-bucket

# Or from PyPI
pip install llm-rate-limit-bucket

from llm_rate_limit_bucket import RateLimitBucket

# Match your API tier limits
rpm_bucket = RateLimitBucket(rate=50, per=60)    # 50 RPM
tpm_bucket = RateLimitBucket(rate=40_000, per=60)  # 40K TPM

async def call_anthropic_safe(messages: list[dict], model: str, max_tokens: int = 1024) -> dict:
    # Estimate tokens for TPM limit check
    estimated_tokens = sum(
        len(str(m.get("content", ""))) // 4
        for m in messages
    ) + max_tokens

    # Acquire both RPM and TPM limits
    await rpm_bucket.acquire()
    await tpm_bucket.acquire(tokens=estimated_tokens)

    return await anthropic_client.messages.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
    )

Sibling Libraries

Library	What it solves
`llm-retry`	Exponential backoff retry for 429s that still occur
`llm-circuit-breaker-py`	Open circuit when provider is overwhelmed
`llm-fallback-chain`	Route to another provider when rate limited
`token-budget-pool`	Track cumulative token usage against a budget
`agent-budget-coordinator`	Coordinate rate limits with other budget constraints

The reliability stack: llm-rate-limit-bucket prevents hitting limits, llm-retry handles limits that slip through, llm-circuit-breaker-py opens the circuit when the provider is down, llm-fallback-chain routes to an alternative.

What's Next

Adaptive rate adjustment: when a 429 is received, automatically lower the effective rate by 20% for 60 seconds. Recover slowly as requests succeed. This makes the bucket self-tuning instead of requiring manual calibration.

Multi-provider buckets: RateLimitRegistry({"anthropic": {...}, "openai": {...}}) that holds one bucket per provider. When routing to a provider, acquire from that provider's bucket. When a provider is exhausted, the registry can redirect to another with remaining capacity.

Burst mode: RateLimitBucket(rate=50, per=60, burst=100) that allows a burst of up to 100 immediate requests while still averaging to 50 RPM over 60 seconds. This models provider limits more accurately (providers often allow short bursts above the sustained rate).

Built as part of the agent-stack family: composable Python primitives for production LLM agents.