Every enterprise AI deployment eventually hits a rate limiting problem. The question is whether you designed for it or discovered it during a production incident.
The discovery version usually looks like this: a batch process kicks off, hammers the LLM API with hundreds of concurrent requests, hits rate limits, and either crashes, produces partial results that nobody knows are partial, or creates a thundering herd that takes down unrelated services. The post-mortem is always the same: "we need to add rate limiting."
Here is how to design for it from the start.
Understanding what you are actually rate limited on
LLM APIs rate limit on multiple dimensions simultaneously, and hitting any one of them causes failures. The main ones are:
Requests per minute (RPM): total API calls per minute regardless of size. Tokens per minute (TPM): total tokens processed across all requests. Tokens per day (TPD): daily token budget for paid tiers. Concurrent requests: active requests at any moment.
Your backpressure strategy needs to handle all of these, not just the one you hit first.
import asyncio
import time
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class RateLimitConfig:
requests_per_minute: int = 60
tokens_per_minute: int = 90000
max_concurrent: int = 10
retry_base_delay: float = 1.0
retry_max_delay: float = 60.0
retry_max_attempts: int = 5
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.monotonic()
def consume(self, tokens: int) -> bool:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
async def wait_and_consume(self, tokens: int):
while not self.consume(tokens):
await asyncio.sleep(0.1)
The async rate-limited client
import asyncio
from openai import AsyncOpenAI
class RateLimitedLLMClient:
def __init__(self, config: RateLimitConfig):
self.config = config
self.client = AsyncOpenAI()
self.semaphore = asyncio.Semaphore(config.max_concurrent)
self.rpm_bucket = TokenBucket(
capacity=config.requests_per_minute,
refill_rate=config.requests_per_minute / 60
)
self.tpm_bucket = TokenBucket(
capacity=config.tokens_per_minute,
refill_rate=config.tokens_per_minute / 60
)
async def complete(
self,
messages: list,
estimated_tokens: int = 1000,
model: str = "gpt-4o"
) -> str:
# Wait for token budget
await self.rpm_bucket.wait_and_consume(1)
await self.tpm_bucket.wait_and_consume(estimated_tokens)
async with self.semaphore:
for attempt in range(self.config.retry_max_attempts):
try:
response = await self.client.chat.completions.create(
model=model,
messages=messages
)
return response.choices[0].message.content
except Exception as e:
if "rate_limit" in str(e).lower():
delay = min(
self.config.retry_base_delay * (2 ** attempt),
self.config.retry_max_delay
)
await asyncio.sleep(delay)
continue
raise
raise RuntimeError(f"Rate limit retry exhausted after {self.config.retry_max_attempts} attempts")
Queue-based backpressure for batch workloads
For batch processing where you are sending many requests, a queue with a controlled worker pool is cleaner than managing concurrency ad hoc.
import asyncio
from asyncio import Queue
async def batch_process_with_backpressure(
items: list,
process_func,
client: RateLimitedLLMClient,
max_workers: int = 5
) -> list:
queue = Queue()
results = [None] * len(items)
for i, item in enumerate(items):
await queue.put((i, item))
async def worker():
while True:
try:
idx, item = queue.get_nowait()
except asyncio.QueueEmpty:
break
try:
results[idx] = await process_func(item, client)
except Exception as e:
results[idx] = {"error": str(e), "item": item}
finally:
queue.task_done()
workers = [asyncio.create_task(worker()) for _ in range(max_workers)]
await asyncio.gather(*workers)
return results
What to monitor
Once this is in production, you want three metrics in your dashboard. Rate limit hit rate: how often requests are being delayed or retried due to limits. Queue depth: for batch workloads, how many items are waiting. p95 latency including wait time: not just inference latency but the wall-clock time from request to response including any queuing delay.
The queue depth metric is the early warning. If it is growing faster than it is draining, you have a capacity problem that needs addressing before it becomes a user-facing problem.
One last thing: if you are self-hosting your LLM inference with vLLM or similar, the rate limiting design is the same but the limits are yours to set. Configure them deliberately based on your hardware capacity and expected workload distribution. The default vLLM configuration has no rate limiting, which means a runaway batch job can starve interactive users. Set the limits before you find out what happens without them.
Top comments (0)