Pranay Batta

Posted on Apr 28

Rate Limiting in LLM Applications: Why You Need It and How to Build It

#ai #llm #opensource #tutorial

TL;DR: Rate limiting for LLM APIs requires counting tokens, not requests. A single 200K-token context window costs as much as 50 normal API calls. This post covers the gap between request-count limits and token-aware limits, and walks through implementation at both the application layer and the gateway layer.

This post assumes familiarity with LLM APIs (OpenAI, Anthropic), basic Redis or caching concepts, and running AI applications in production.

Why Standard Rate Limiting Falls Short

Most developers who have shipped web services know how to rate limit: count requests per user per time window, return 429 when the limit hits. That model breaks down with LLM APIs.

LLM APIs charge by the token, not the request. A single API call with a 200,000-token context window costs as much as 50 calls with 4,000-token prompts. Request-count limits do nothing to prevent a single runaway call from consuming your entire daily budget.

OpenAI's production limits expose this directly. Their rate limit tiers use tokens-per-minute (TPM) alongside requests-per-minute (RPM). Hitting the TPM ceiling causes 429s even when you are nowhere near the RPM limit. Building rate limiting that only tracks requests means your application hits provider limits in ways your own limits never predicted.

Multi-tenant applications add another layer. A single customer running a batch job at 3am can exhaust your provider budget before the rest of your users wake up. Without per-customer limits, one heavy user affects everyone.

What You Actually Need to Limit

Four distinct limit types matter in production LLM applications:

Request rate — calls per minute or hour. Prevents burst abuse but does not control cost.
Token rate — tokens per minute or day. Directly correlates to cost and provider headroom.
Budget cap — total spend per period per customer or team. Hard stop before costs escalate.
Scope — limits enforced per user, per team, per customer, and per provider independently.

Most teams implement request rate first, add token rate after their first surprise invoice, and add budget caps after their second.

Option 1: Application-Level Implementation

The direct approach is middleware that intercepts outgoing API calls, estimates token count before the request leaves your system, and rejects requests that would exceed the limit.

import redis
import time

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def check_token_limit(
    customer_id: str,
    estimated_tokens: int,
    limit: int = 500_000,
    window_seconds: int = 86400
) -> bool:
    window_key = int(time.time() // window_seconds)
    key = f"token_usage:{customer_id}:{window_key}"

    pipe = r.pipeline()
    pipe.incrby(key, estimated_tokens)
    pipe.expire(key, window_seconds * 2)
    result = pipe.execute()

    return result[0] <= limit

def estimate_tokens(messages: list) -> int:
    # ~4 characters per token, rough pre-call estimate
    total_chars = sum(len(m.get("content", "")) for m in messages)
    return total_chars // 4

This works, but requires every service that makes LLM calls to implement the same logic. In a monolith, manageable. Across microservices, it becomes duplicated state tracking with consistency problems at the edges.

Option 2: Gateway-Level Rate Limiting

A gateway that proxies all LLM traffic enforces limits in one place. Every service routes through the gateway. The gateway handles counting, enforcement, and resets.

Bifrost handles this through Virtual Keys, each scoped to a customer or team, with request and token limits defined per key:

virtual_keys:
  - key_name: "customer-acme"
    key: "vk-acme-abc123"
    rate_limit:
      request_limit: 200
      request_limit_duration: "1h"
      token_limit: 500000
      token_limit_duration: "1d"
    budget_limit: 100.00
    budget_duration: "1M"
    allowed_models:
      - "gpt-4o"
      - "claude-sonnet-4-6"

When customer-acme exhausts their daily token limit, Bifrost rejects further requests for that key until the window resets. Other customers are unaffected.

Resets are calendar-aligned for day, week, month, and year durations. A 1d limit resets at UTC midnight rather than 24 hours after the first request. For billing cycles that align to calendar months, this matters.

LiteLLM offers comparable virtual key functionality. The primary runtime difference: LiteLLM is Python-based with roughly 8ms overhead per request. Bifrost is Go-based with 11 microseconds overhead per request.

Comparison

Approach	Token-aware	Per-customer limits	Budget cap	Overhead
Redis middleware (DIY)	Manual	Yes	Manual	Negligible
LiteLLM proxy	Yes	Yes	Yes	~8ms
Bifrost	Yes	Yes (Virtual Keys)	Yes (4-tier)	11 microseconds
Kong AI Gateway	Plugin-based	Yes	Limited (OSS)	~2-5ms

Bifrost's four-tier budget hierarchy is worth noting: Customer, Team, Virtual Key, and Provider Config limits all apply independently. A request must pass all four tiers. This allows organization-wide caps alongside fine-grained per-key limits without separate enforcement logic.

If a Provider Config limit is exceeded, Bifrost excludes that provider but keeps others available. Requests do not fail outright when one provider is saturated.

Trade-offs and Limitations

Application-level rate limiting gives you more control over enforcement logic. You can implement business rules a gateway does not support: tiered limits based on subscription plan, grace period overrides for specific customers, or custom token counting that accounts for your system prompt overhead.

Gateway-level enforcement applies regardless of which service makes the call. The trade-off is an additional network hop and a new dependency in your infrastructure.

Bifrost is self-hosted only, no managed version. The project is newer than LiteLLM with a smaller community. Factor in that maturity difference when evaluating it against more established options.

Token counting before a request completes is an estimate. Actual token counts, including generated output tokens, only come back in the API response. Most gateway implementations use pre-call estimates for limits and reconcile against actual usage in the response.

Quick Recap

Request-count limits do not prevent token budget overruns
Multi-tenant apps need per-customer token limits, not global ones
Application-level implementation works but duplicates logic across services
Gateway-level enforcement centralizes limits with no per-service code changes
Bifrost and LiteLLM both support virtual key rate limiting; the primary difference is runtime overhead

DEV Community