Rate Limiting and Backpressure for Enterprise AI APIs: The Part Nobody Designs Until It Breaks

#ai #api #llm #systemdesign

Every enterprise AI deployment eventually hits a rate limiting problem. The question is whether you designed for it or discovered it during a production incident.

The discovery version usually looks like this: a batch process kicks off, hammers the LLM API with hundreds of concurrent requests, hits rate limits, and either crashes, produces partial results that nobody knows are partial, or creates a thundering herd that takes down unrelated services. The post-mortem is always the same: "we need to add rate limiting."

Here is how to design for it from the start.

Understanding what you are actually rate limited on

LLM APIs rate limit on multiple dimensions simultaneously, and hitting any one of them causes failures. The main ones are:

Requests per minute (RPM): total API calls per minute regardless of size. Tokens per minute (TPM): total tokens processed across all requests. Tokens per day (TPD): daily token budget for paid tiers. Concurrent requests: active requests at any moment.

Your backpressure strategy needs to handle all of these, not just the one you hit first.

import asyncio
import time
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class RateLimitConfig:
    requests_per_minute: int = 60
    tokens_per_minute: int = 90000
    max_concurrent: int = 10
    retry_base_delay: float = 1.0
    retry_max_delay: float = 60.0
    retry_max_attempts: int = 5

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.monotonic()

    def consume(self, tokens: int) -> bool:
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

    async def wait_and_consume(self, tokens: int):
        while not self.consume(tokens):
            await asyncio.sleep(0.1)

The async rate-limited client

import asyncio
from openai import AsyncOpenAI

class RateLimitedLLMClient:
    def __init__(self, config: RateLimitConfig):
        self.config = config
        self.client = AsyncOpenAI()
        self.semaphore = asyncio.Semaphore(config.max_concurrent)
        self.rpm_bucket = TokenBucket(
            capacity=config.requests_per_minute,
            refill_rate=config.requests_per_minute / 60
        )
        self.tpm_bucket = TokenBucket(
            capacity=config.tokens_per_minute,
            refill_rate=config.tokens_per_minute / 60
        )

    async def complete(
        self,
        messages: list,
        estimated_tokens: int = 1000,
        model: str = "gpt-4o"
    ) -> str:
        # Wait for token budget
        await self.rpm_bucket.wait_and_consume(1)
        await self.tpm_bucket.wait_and_consume(estimated_tokens)

        async with self.semaphore:
            for attempt in range(self.config.retry_max_attempts):
                try:
                    response = await self.client.chat.completions.create(
                        model=model,
                        messages=messages
                    )
                    return response.choices[0].message.content

                except Exception as e:
                    if "rate_limit" in str(e).lower():
                        delay = min(
                            self.config.retry_base_delay * (2 ** attempt),
                            self.config.retry_max_delay
                        )
                        await asyncio.sleep(delay)
                        continue
                    raise

            raise RuntimeError(f"Rate limit retry exhausted after {self.config.retry_max_attempts} attempts")

Queue-based backpressure for batch workloads

For batch processing where you are sending many requests, a queue with a controlled worker pool is cleaner than managing concurrency ad hoc.

import asyncio
from asyncio import Queue

async def batch_process_with_backpressure(
    items: list,
    process_func,
    client: RateLimitedLLMClient,
    max_workers: int = 5
) -> list:
    queue = Queue()
    results = [None] * len(items)

    for i, item in enumerate(items):
        await queue.put((i, item))

    async def worker():
        while True:
            try:
                idx, item = queue.get_nowait()
            except asyncio.QueueEmpty:
                break

            try:
                results[idx] = await process_func(item, client)
            except Exception as e:
                results[idx] = {"error": str(e), "item": item}
            finally:
                queue.task_done()

    workers = [asyncio.create_task(worker()) for _ in range(max_workers)]
    await asyncio.gather(*workers)

    return results

What to monitor

Once this is in production, you want three metrics in your dashboard. Rate limit hit rate: how often requests are being delayed or retried due to limits. Queue depth: for batch workloads, how many items are waiting. p95 latency including wait time: not just inference latency but the wall-clock time from request to response including any queuing delay.

The queue depth metric is the early warning. If it is growing faster than it is draining, you have a capacity problem that needs addressing before it becomes a user-facing problem.

One last thing: if you are self-hosting your LLM inference with vLLM or similar, the rate limiting design is the same but the limits are yours to set. Configure them deliberately based on your hardware capacity and expected workload distribution. The default vLLM configuration has no rate limiting, which means a runaway batch job can starve interactive users. Set the limits before you find out what happens without them.

DEV Community

Rate Limiting and Backpressure for Enterprise AI APIs: The Part Nobody Designs Until It Breaks

Top comments (0)