Most teams size their LLM usage off one number: projected monthly cost. You take your requests per minute, multiply by tokens, multiply by the per-token price, and you have a budget. That number is correct and almost completely useless for answering the question that actually wakes you up: when does production start throwing 429s?
Cost is a smooth, linear function of tokens. Rate limits are a step function of several independent dimensions, and the one that binds first is usually not the one you watched. Datadog's State of AI Engineering 2026 puts a number on how common this failure is: roughly 5% of LLM call spans error, about 60% of those errors are rate limits, and March 2026 alone saw ~8.4M 429s across their fleet. That is not a tail event. That is the default outcome of capacity-planning from the bill.
Why cost and capacity are different functions
When you call a frontier model API, the provider does not enforce one limit. It enforces several, separately, and a 429 fires the instant any single one is exceeded.
Anthropic measures, per model and per usage tier:
- RPM — requests per minute
- ITPM — input tokens per minute
- OTPM — output tokens per minute
These are tracked independently. You can be at 4% of your RPM budget and 100% of your OTPM budget and you will get throttled, because OTPM bound first. Cost-based planning never surfaces this, because cost collapses input and output tokens into one dollar figure and never looks at the per-minute rate at all.
OpenAI does the same thing with a different cut: RPM, TPM (combined tokens/min), plus daily ceilings RPD and TPD. The daily caps are the sneaky ones — a workload that is comfortably under TPM can still walk into a wall at hour 19 of a steady-state day because it crossed TPD. Nothing in your cost model has a "day" in it.
The per-second quantization trap
Here is the detail that catches even teams who do think about RPM.
A "60 RPM" limit is not "60 requests at any point within a 60-second window." Providers quantize per-minute limits down to a shorter bucket — effectively per-second. A 60 RPM limit behaves much closer to "1 request per second," not "60 requests you can fire in a burst at t=0 and then idle." If your traffic is bursty (a queue drains, a cron fans out, a retry storm kicks in), you can be averaging well under 60 RPM over the minute and still 429, because you exceeded the instantaneous allowance. The minute-average looks healthy in your dashboard. The per-second bucket does not.
Two more nuances worth internalizing:
-
Anthropic cached reads do not count toward ITPM the same way fresh input does. If you use prompt caching heavily, your effective ITPM headroom is larger than a naive
requests x input_tokensestimate suggests. Planning without modeling the cache underestimates how much real traffic you can push. - OTPM is the most underestimated dimension. Output is where reasoning models and long generations blow up. Teams routinely provision for input volume, eyeball output as "smaller," and get paged when a feature that generates long responses ships.
A worked example
Say you're running Claude Sonnet, and you've reasoned about load like this: "100 requests/min, ~2,000 input tokens, ~500 output tokens each. We're a high tier, limits are huge, we're fine."
Let's actually compute the three dimensions instead of eyeballing:
requests/min = 100
input tokens/min = 100 * 2000 = 200,000 -> ITPM load
output tokens/min = 100 * 500 = 50,000 -> OTPM load
Sonnet, high tier (snapshot 2026-05-15):
RPM limit = 4,000 load 100 -> 2.5% used
ITPM limit = 2,000,000 load 200,000 -> 10.0% used
OTPM limit = 400,000 load 50,000 -> 12.5% used
You have plenty of RPM headroom — 40x. But OTPM is the binding dimension: it's the one closest to saturation, and it's the one that will 429 you first if traffic grows. If you'd planned from cost alone, your mental model would have been "we're at 2.5% of capacity" (the RPM number, because requests are what people count). You're actually at 12.5%, on a dimension you weren't watching, and a 5x traffic increase — easy for any growing product — puts you over the OTPM ceiling while RPM is still at 12%. The page comes from the dimension cost never showed you.
Flip the model to a different tier or to OpenAI's combined-TPM-plus-daily-cap scheme and the binding dimension changes. There is no single rule of thumb. It depends on your token shape, your model, and your tier — which is exactly why eyeballing it fails.
I built a calculator for this
I got tired of doing this arithmetic in a scratch buffer every time a workload changed, so I built a small tool:
https://llmcapplanner.vercel.app/
It's free, client-side, and has no signup — nothing you type leaves your browser. You pick a model (Anthropic or OpenAI), a usage tier, and your expected requests/min plus average input/output tokens per request. It returns:
- Your projected monthly cost (the number you already had).
- Which rate-limit dimension binds first, with the exact headroom remaining on every dimension — RPM, ITPM, OTPM for Anthropic; RPM, TPM, RPD, TPD for OpenAI.
It carries a dated pricing-and-limits snapshot ("as of 2026-05-15") with links to the official provider pricing and rate-limit docs, so you can verify every number against your own dashboard rather than trusting a stale blog. Limits change and tiers differ per account — the tool is a fast first-pass model, not a substitute for your provider console, and it says so.
The takeaway, with or without the tool: stop sizing LLM workloads from the bill. Compute RPM, input-tokens/min, and output-tokens/min separately, against your model and tier's actual limits, find the one that saturates first, and watch that one. Account for per-second quantization on bursty traffic, count cached reads correctly, and don't forget the daily caps. The 429 cascade is preventable. It just isn't preventable with a cost spreadsheet.
Top comments (0)