The LLM 429 you didn't plan for: which rate-limit dimension binds first

LLM Cap Planner — Fri, 15 May 2026 13:04:52 +0000

Most LLM-app incidents I've watched over the past year were not model-quality problems. They were 429 Too Many Requests. And almost every team that hit one had sized capacity off a blog table that was already stale by the time they read it.

This is a short writeup of that failure mode, the part of provider rate limiting that is genuinely under-documented, and a small client-side tool I built so I could stop guessing.

429s are the dominant production failure mode

Aggregate production telemetry across LLM apps tells a consistent story: a meaningful fraction of all LLM call spans error, and the majority of those errors are rate-limit rejections — not 5xx, not timeouts. The reason is structural. Inference is expensive, so providers meter aggressively, and the default limits are low enough that a modest traffic increase crosses them. What makes it nasty is that you usually discover the ceiling by getting paged, not by reading docs.

The part people get wrong: limits are multi-dimensional and per-model

Here is the nuance generic "cost calculators" miss.

Anthropic meters at least three independent dimensions, separately:

RPM — requests per minute
ITPM — input tokens per minute
OTPM — output tokens per minute

You can sit at 10% of your RPM and still get 429ed because your average prompt is large and you hit ITPM first. A single combined "tokens per minute" number cannot represent this — the binding constraint depends on your input/output shape, not just your request rate.

These limits are per model, not per tier. This is the one that surprises people. At the same tier, Claude Opus and Claude Sonnet do not share an ITPM number — Opus's input-token allowance is many times larger than Sonnet's at an equivalent tier. Concretely, in the snapshot I maintain: at Tier 1, Opus 4.7 ITPM is 500,000 while Sonnet 4.6 ITPM is 30,000. Any tool that prints "Tier 1 = X tokens/min" without asking which model is structurally wrong.

Two more Anthropic specifics worth knowing:

Cached input reads don't count toward ITPM. Prompt-cache hits change your effective ceiling, so a cache-heavy workload has very different headroom than the naive math suggests.
"Per-minute" is enforced closer to per-second. A 60 RPM limit is not "60 requests anywhere in a 60s window" — it behaves like roughly 1 request per second. A burst of 5 requests inside one second can 429 you while your per-minute average is comfortably under the cap. If your traffic is spiky, size for ceil(RPM / 60) per second, not the per-minute figure.

OpenAI meters four independent dimensions: RPM and TPM, plus per-day RPD and TPD ceilings. The per-day ones specifically bite batch and backfill jobs — you pass every per-minute check and then die at hour 18.

Why a dated snapshot matters

Stale tables are dangerous because of churn, not laziness. The model lineup moved twice in about five months — pricing and the available models both changed. Any capacity number you wrote down a year ago may describe models that no longer exist. A planning table is only useful if it carries the date it was true and gets re-verified against the provider dashboard.

The tool

I put the math behind a single static page: https://llmcapplanner.vercel.app/

Pick a provider + model + tier, enter requests/min and average input/output tokens.
It shows projected monthly cost at sustained load and which dimension (RPM / ITPM / OTPM / TPM) binds first, with the headroom on each.
It carries a dated snapshot (currently 2026-05-15) with per-model Anthropic limits, and flags the per-second quantization when RPM is the binding dimension.

It is fully client-side and deterministic. No API calls, no signup, nothing leaves the browser — it is arithmetic over a dated constants table, not a service. The official provider doc links are in the footer so you can check every number against your own dashboard.

Try it here: https://llmcapplanner.vercel.app/

The honest caveat: presets drift. If you spot a pricing or rate-limit number that has gone stale, please flag it — the maintenance is the entire point of the thing, and a wrong number that looks authoritative is worse than no number at all.

DEV Community: LLM Cap Planner

The LLM 429 you didn't plan for: which rate-limit dimension binds first

429s are the dominant production failure mode

The part people get wrong: limits are multi-dimensional and per-model

Why a dated snapshot matters

The tool