Problem Framing: The Cost of Naiveté

#systemdesign #llm #ratelimiting #backendengineering

Most rate limiters are designed to manage request volume, preventing system overload and abuse. But when you’re dealing with LLM API calls, a single request isn't just "one request"—it can be a $5 transaction or take 60 seconds to complete. Your standard distributed counter or token bucket approach will quickly burn through budgets and exhaust critical resources.

Problem Framing: The Cost of Naiveté

Imagine you're building an AI-powered assistant. Users interact with it, triggering calls to an expensive LLM API. A simple rate limit, say 10 requests per second per user, seems reasonable. Now, consider a user who sends one complex prompt that generates a 50,000-token response, costing $10 and taking 30 seconds. With a naive rate limit, this user still has 9 "requests" remaining for that second, which could be another 9 expensive calls, costing $100 and congesting your LLM gateway. Meanwhile, another user needing a quick, cheap 100-token summary might be blocked because the first user's long-running request is tying up the underlying LLM capacity. You're not just preventing DDoS; you're managing a financial burn rate and ensuring fair resource allocation for non-uniform work. The system fails when it treats a $0.001 request the same as a $10 request.

Core Concept: Cost-Aware Rate Limiting

Effective rate limiting for LLMs needs to go beyond simple request counts. It requires a cost-aware or resource-aware approach. Instead of merely counting requests, you assign a "weight" or "cost unit" to each potential API call. This cost can be an estimation of:

Tokens: Input + estimated output tokens.
Monetary Cost: Based on provider pricing (e.g., $X per 1k tokens).
Processing Time: Estimated latency for the specific model and prompt complexity.

Your rate limiter then operates on these cost units. For example, a user might be allowed 100,000 cost units per minute, where a simple call consumes 100 units and a complex one consumes 10,000 units. A common pattern is to use a token bucket or leaky bucket, but instead of "tokens" representing requests, they represent these "cost units."

Here's how a cost-aware rate limiter might integrate into your LLM service:

+---------------------+        +---------------------+        +---------------------+
|  Incoming LLM Call  | ---->  |  Request Parser     | ---->  |  Policy Engine      |
| (user_id, model_id, |        | (Extracts prompt,   |        | (Defines cost rules:|
|     prompt)         |        |  params, headers)   |        |  e.g., model_A = $X/ |
+---------------------+        +---------------------+        |  token, user_tier_Y |
                                                               |  has budget $Z/min) |
                                                               +---------+---------+
                                                                         |
                                                                         V
                                                        +---------------------------+
                                                        |  Cost Estimator           |
                                                        | (Calculates estimated cost|
                                                        |  for this request based   |
                                                        |  on policy and input)     |
                                                        +---------+---------+
                                                                  |
                                                                  V
                                                        +---------------------------+
                                                        |  Rate Limiter Backend     |
                                                        | (e.g., Redis HSET user_id |
                                                        |  { 'cost_spent_min': X,   |
                                                        |    'req_count_min': Y,    |
                                                        |    'last_reset': TS })    |
                                                        |  Decision: ALLOW/DENY     |
                                                        +---------+---------+
                                                                  | (ALLOW)
                                                                  V
                                                        +---------------------+
                                                        |  LLM Service Proxy  |
                                                        | (Forwards request to|
                                                        |  LLM Provider)      |
                                                        +---------------------+

When a request arrives, the Request Parser extracts relevant details. The Policy Engine defines the rules (e.g., gpt-4-turbo costs $10/1M input tokens, $30/1M output tokens; premium users get 5x standard budget). The Cost Estimator then calculates the estimated cost of the incoming request. This estimation considers factors like input token count, chosen model, and a heuristic for expected output tokens (e.g., average response length, or a configurable maximum).

The Rate Limiter Backend (often Redis for distributed counters) then checks if the user/tenant has enough "budget" (cost units) remaining within the defined time window. If allowed, the estimated cost is deducted, and the request is forwarded.

Real-World Application: OpenAI's Token-Based Limits

OpenAI itself uses a form of cost-aware rate limiting. Instead of just "Requests Per Minute" (RPM), they impose "Tokens Per Minute" (TPM) limits. For example, a gpt-4 model might have a limit of 10,000 RPM and 1,000,000 TPM. This means you could theoretically send many small requests that sum up to 1M tokens, or fewer, larger requests.

This combined limit forces developers to consider both the sheer volume and the computational/cost weight of their API calls. If you hit your TPM limit, even if you haven't hit your RPM limit, your requests are throttled. This effectively manages the load on their GPUs and the financial burden for users.

Organizations building on top of LLMs, like Stripe (for internal fraud detection using AI) or Uber (for customer support summarization), would implement similar cost-aware strategies. They might allocate a specific budget to each internal team or external customer, measured in tokens or estimated dollars per hour/day. When a request comes in, it's checked against that team's remaining budget. If a request is estimated to cost $0.50 and the team only has $0.20 remaining for the hour, the request is denied or queued. Post-call, actual token usage and cost can be reconciled, and overages might incur penalties or stricter temporary limits.

Common Mistakes

Treating all LLM requests equally: The most fundamental mistake. A simple "hello world" prompt to a cheap model is not the same as a complex prompt engineering chain for code generation on an expensive model. Failing to differentiate leads to uneven resource consumption and inaccurate billing/budgeting.
Ignoring non-determinism in LLM responses: LLM output length (and thus token count) is often non-deterministic. If you estimate cost solely on input tokens, you'll frequently under-allocate budget. Strong solutions pre-allocate based on a conservative estimate (e.g., input tokens + max expected output tokens or a high percentile of historical output), then reconcile the actual cost after the LLM call. If the actual cost exceeds the pre-allocated budget, you might temporarily penalize the user or mark it as an overage.
Only applying limits at the service ingress: If your rate limiter is only at the API Gateway, it might catch basic abuse. However, for LLM-specific limits, you often need context from the request payload (e.g., the prompt length, specific model ID). This requires the rate limiter to be closer to the application logic, often implemented as a middleware or proxy before the call leaves your infrastructure for the LLM provider.
Static pricing/cost models: LLM costs and model capabilities evolve rapidly. Hardcoding cost units or assuming fixed pricing is brittle. Your Policy Engine must be configurable, ideally pulling pricing and model details from a dynamic source or a regularly updated configuration.

Interview Angle

Interviewers will test your understanding of these nuances:

"How do you handle the non-deterministic nature of LLM output tokens when estimating cost for rate limiting?"
- Strong Answer: "You can't get it perfectly upfront. I'd implement a two-phase commit: first, estimate based on input tokens plus a generous, configurable max_output_tokens, or a percentile from historical data for that (user_id, model_id) pair. Deduct this estimated cost. After the LLM call returns, get the actual token usage. If the actual is less than estimated, credit the difference back. If it's significantly more, log an overage, potentially apply a temporary stricter limit, or trigger an alert. This balances immediate enforcement with eventual consistency."
"What if a user intentionally tries to exhaust their budget with short, cheap prompts but many of them, or a few very expensive ones?"
- Strong Answer: "This is why you need multi-dimensional limits. We'd have limits on both 'cost units per minute' and 'requests per minute.' The cost unit limit handles expensive calls, while the request limit prevents flooding with many cheap calls. For expensive prompts, you might also introduce a 'concurrent expensive requests' limit to prevent single users from monopolizing LLM capacity."
"How would you store and manage these cost-aware rate limiting states in a distributed system?"
- Strong Answer: "We'd use a distributed key-value store like Redis. For each user_id (or client_id, tenant_id), we'd store a hash map containing current_cost_spent, current_request_count, and last_reset_timestamp for each time window (e.g., minute, hour). We'd use Redis's INCRBY (for cost units) and EXPIRE for the time window reset. Atomic operations are crucial to prevent race conditions during updates."

Need to refine your system design skills for real-world scenarios?
Book a 1:1 session with me on Topmate to deep dive into advanced patterns and interview strategies.

Want to Go Deeper?

I do 1:1 sessions on system design, backend architecture, and interview prep.
If you're preparing for a Staff/Senior role or cracking FAANG rounds — book a session here.