DEV Community: SolvoHQ

The LLM rate limit that 429s you first is rarely the one you sized for — so I gave my agent a tool to compute it

SolvoHQ — Fri, 15 May 2026 15:16:53 +0000

You size an LLM workload by looking at two numbers: the price per million tokens, and the requests-per-minute ceiling on the pricing page. You multiply, you eyeball the RPM limit, you decide you have headroom. Then you scale up and start eating 429 Too Many Requests — and the dimension that's throttling you is not the one you checked.

This is not a cost problem. It's a "which constraint binds first" problem, and the binding constraint moves depending on your token mix and your tier. Eyeballing the pricing page cannot tell you which one it is. So I built a deterministic tool that computes it — usable as a web app, and as an MCP server you plug into Claude or your coding agent so it answers capacity questions with arithmetic instead of a guess.

Anthropic doesn't have one rate limit

For Anthropic, a model+tier has three independent ceilings, all enforced per minute:

RPM — requests/minute
ITPM — input tokens/minute
OTPM — output tokens/minute

They are not proportional to each other, and the one that 429s you depends entirely on your average input/output token shape. A retrieval-heavy app with 8K-token prompts and 200-token answers is ITPM-bound. A short-prompt, long-generation agent loop is OTPM-bound. Same model, same tier, opposite binding dimension.

A worked example with today's real numbers

Using the live snapshot dated 2026-05-15, here are claude-sonnet-4-6 limits:

Tier	RPM	ITPM	OTPM
Tier 1	50	30,000	8,000
Tier 4	4,000	2,000,000	400,000

Pricing: $3 / 1M input, $15 / 1M output.

Now take a real traffic profile: 600 requests/minute, 2,000 input tokens, 500 output tokens per request.

Per-minute demand:

RPM: 600
ITPM: 600 × 2,000 = 1,200,000
OTPM: 600 × 500 = 300,000

At Tier 4, utilization per dimension:

RPM: 600 / 4,000 = 15%
ITPM: 1,200,000 / 2,000,000 = 60%
OTPM: 300,000 / 400,000 = 75%

You are within all limits — but OTPM binds first at 75%. Not RPM (the number everyone checks: a comfortable 15%). Not ITPM. When you scale this workload, output tokens per minute is the wall you hit, and a quota increase on anything else buys you nothing. The binding dimension was non-obvious from the pricing page, and it would have flipped to ITPM if your prompts were larger.

Monthly cost for that profile: input is 600 × 2,000 × 60 × 24 × 30 = 51.84B tokens; output is 12.96B tokens. At $3 and $15 per million:

≈ $349,920 / month.

And at Tier 1? The same profile doesn't "cost more" — it doesn't run at all. ITPM demand of 1,200,000/min against a 30,000/min ceiling is a 40× overshoot; you hard-429 on ITPM immediately, long before RPM or cost is relevant. The constraint that ends you changes with the tier. Eyeballing the table won't surface that.

The tool: one function, deterministic, dated

The computation is pure arithmetic against a date-stamped snapshot — no provider API call, no network, fully offline. The web app is at https://llmcapplanner.vercel.app. The open-source code and the MCP server live at https://github.com/SolvoHQ/llmcapplanner (the server is in the mcp/ directory).

The MCP server exposes a single tool:

llm_capacity_plan(provider, model, tier, rpm, in_tok, out_tok)
  -> {
       monthly_cost,
       first_binding_429_dim,
       headroom_per_dim,
       will_429,
       snapshot_version
     }

For the example above it returns, deterministically:

{
  "monthly_cost": 349920,
  "first_binding_429_dim": "OTPM",
  "headroom_per_dim": { "RPM": 3400, "ITPM": 800000, "OTPM": 100000 },
  "will_429": false,
  "snapshot_version": "2026-05-15"
}

headroom_per_dim is per-minute slack; first_binding_429_dim is the one you must raise before scaling.

Why the MCP server matters

When your AI coding agent is asked "can we run this at 600 rpm on Sonnet Tier 4, and what will it cost?", it has two options: hallucinate a plausible-looking number, or call a tool that does the arithmetic. Wire this in and it does the latter — and every answer carries snapshot_version, so the agent (and you) know exactly how fresh the numbers are instead of trusting a model's stale memory of a 2024 pricing page.

The compiled server is committed to the repo, so it runs from a clone with no build step:

{
  "mcpServers": {
    "llmcapplanner": {
      "command": "node",
      "args": ["/absolute/path/to/llmcapplanner/mcp/dist/index.js"]
    }
  }
}

Clone, point your MCP client at mcp/dist/index.js, done.

The actual differentiator: the snapshot is dated and maintained

Most LLM cost calculators are a hardcoded table someone typed in 2024 and never touched. That table is now wrong, and worse, it won't tell you it's wrong. Model lineups churn fast — in 2026 alone Anthropic moved Opus 4.6 → 4.7 and OpenAI went GPT-5.4 → 5.5, each with its own pricing and limit deltas. An undated table is not a convenience; it's a liability that produces confident wrong answers.

This tool's data is a single versioned snapshot (2026-05-15), and that version string rides along in every response — web and MCP. If the answer is stale, you can see that it's stale. That's the whole point: capacity math is only useful if you know the date it was true.

The numbers above were produced by the committed MCP server, not typed by hand. Try the web app at https://llmcapplanner.vercel.app, or read the source and run the server from https://github.com/SolvoHQ/llmcapplanner.

Your LLM bill is not your capacity plan. Here's the math that pages you at 2am.

SolvoHQ — Fri, 15 May 2026 12:18:53 +0000

Most teams size their LLM usage off one number: projected monthly cost. You take your requests per minute, multiply by tokens, multiply by the per-token price, and you have a budget. That number is correct and almost completely useless for answering the question that actually wakes you up: when does production start throwing 429s?

Cost is a smooth, linear function of tokens. Rate limits are a step function of several independent dimensions, and the one that binds first is usually not the one you watched. Datadog's State of AI Engineering 2026 puts a number on how common this failure is: roughly 5% of LLM call spans error, about 60% of those errors are rate limits, and March 2026 alone saw ~8.4M 429s across their fleet. That is not a tail event. That is the default outcome of capacity-planning from the bill.

Why cost and capacity are different functions

When you call a frontier model API, the provider does not enforce one limit. It enforces several, separately, and a 429 fires the instant any single one is exceeded.

Anthropic measures, per model and per usage tier:

RPM — requests per minute
ITPM — input tokens per minute
OTPM — output tokens per minute

These are tracked independently. You can be at 4% of your RPM budget and 100% of your OTPM budget and you will get throttled, because OTPM bound first. Cost-based planning never surfaces this, because cost collapses input and output tokens into one dollar figure and never looks at the per-minute rate at all.

OpenAI does the same thing with a different cut: RPM, TPM (combined tokens/min), plus daily ceilings RPD and TPD. The daily caps are the sneaky ones — a workload that is comfortably under TPM can still walk into a wall at hour 19 of a steady-state day because it crossed TPD. Nothing in your cost model has a "day" in it.

The per-second quantization trap

Here is the detail that catches even teams who do think about RPM.

A "60 RPM" limit is not "60 requests at any point within a 60-second window." Providers quantize per-minute limits down to a shorter bucket — effectively per-second. A 60 RPM limit behaves much closer to "1 request per second," not "60 requests you can fire in a burst at t=0 and then idle." If your traffic is bursty (a queue drains, a cron fans out, a retry storm kicks in), you can be averaging well under 60 RPM over the minute and still 429, because you exceeded the instantaneous allowance. The minute-average looks healthy in your dashboard. The per-second bucket does not.

Two more nuances worth internalizing:

Anthropic cached reads do not count toward ITPM the same way fresh input does. If you use prompt caching heavily, your effective ITPM headroom is larger than a naive requests x input_tokens estimate suggests. Planning without modeling the cache underestimates how much real traffic you can push.
OTPM is the most underestimated dimension. Output is where reasoning models and long generations blow up. Teams routinely provision for input volume, eyeball output as "smaller," and get paged when a feature that generates long responses ships.

A worked example

Say you're running Claude Sonnet, and you've reasoned about load like this: "100 requests/min, ~2,000 input tokens, ~500 output tokens each. We're a high tier, limits are huge, we're fine."

Let's actually compute the three dimensions instead of eyeballing:

requests/min            = 100
input  tokens/min        = 100 * 2000 = 200,000   -> ITPM load
output tokens/min        = 100 *  500 =  50,000   -> OTPM load

Sonnet, high tier (snapshot 2026-05-15):
  RPM   limit = 4,000        load 100      ->   2.5% used
  ITPM  limit = 2,000,000    load 200,000  ->  10.0% used
  OTPM  limit =   400,000    load  50,000  ->  12.5% used

You have plenty of RPM headroom — 40x. But OTPM is the binding dimension: it's the one closest to saturation, and it's the one that will 429 you first if traffic grows. If you'd planned from cost alone, your mental model would have been "we're at 2.5% of capacity" (the RPM number, because requests are what people count). You're actually at 12.5%, on a dimension you weren't watching, and a 5x traffic increase — easy for any growing product — puts you over the OTPM ceiling while RPM is still at 12%. The page comes from the dimension cost never showed you.

Flip the model to a different tier or to OpenAI's combined-TPM-plus-daily-cap scheme and the binding dimension changes. There is no single rule of thumb. It depends on your token shape, your model, and your tier — which is exactly why eyeballing it fails.

I built a calculator for this

I got tired of doing this arithmetic in a scratch buffer every time a workload changed, so I built a small tool:

https://llmcapplanner.vercel.app/

It's free, client-side, and has no signup — nothing you type leaves your browser. You pick a model (Anthropic or OpenAI), a usage tier, and your expected requests/min plus average input/output tokens per request. It returns:

Your projected monthly cost (the number you already had).
Which rate-limit dimension binds first, with the exact headroom remaining on every dimension — RPM, ITPM, OTPM for Anthropic; RPM, TPM, RPD, TPD for OpenAI.

It carries a dated pricing-and-limits snapshot ("as of 2026-05-15") with links to the official provider pricing and rate-limit docs, so you can verify every number against your own dashboard rather than trusting a stale blog. Limits change and tiers differ per account — the tool is a fast first-pass model, not a substitute for your provider console, and it says so.

The takeaway, with or without the tool: stop sizing LLM workloads from the bill. Compute RPM, input-tokens/min, and output-tokens/min separately, against your model and tier's actual limits, find the one that saturates first, and watch that one. Account for per-second quantization on bursty traffic, count cached reads correctly, and don't forget the daily caps. The 429 cascade is preventable. It just isn't preventable with a cost spreadsheet.

Three no-signup dev tools we shipped this week

SolvoHQ — Fri, 15 May 2026 08:34:03 +0000

SolvoHQ builds small, single-purpose developer utilities that run in the browser with no signup. Here are three we put online this week. Each one takes a paste and gives you typed TypeScript back.

jsontosdk — JSON → typed TypeScript

Stop hand-writing interfaces for an API response you just got. Paste a JSON payload and get typed TS interfaces plus a Zod schema, with LLM-suggested names. Live: https://jsontosdk.vercel.app — Code: https://github.com/SolvoHQ/jsontosdk — Paste your JSON, copy the generated types.

dotenv2types — .env → typed env config

process.env.FOO is string | undefined everywhere and nobody validates it. Paste your .env and get a typed env.ts with a Zod schema and a generated .env.example. Live: https://dotenv2types.vercel.app — Code: https://github.com/SolvoHQ/dotenv2types — Paste your .env file, drop the generated env.ts into your project.

har2sdk — HAR → typed fetch SDK

You captured a network trace and now want a real client. Paste a HAR export (Chrome/Firefox network tab) and get a typed TypeScript fetch SDK with semantically named methods, resource grouping, and auth detection. Live: https://har2sdk.vercel.app — Code: https://github.com/SolvoHQ/har2sdk — Export HAR from devtools, paste it, use the generated client.

Feedback welcome — open an issue at https://github.com/SolvoHQ/jsontosdk/issues

Three no-signup dev tools we shipped this week

SolvoHQ — Fri, 15 May 2026 08:23:12 +0000

SolvoHQ builds small, single-purpose web tools for developers. No login, no install, no account — paste in, copy out. Here are three we put online this week. Each one is open source and runs the same way: paste your input, get typed TypeScript back in a few seconds.

jsontosdk — JSON sample to a typed TypeScript SDK

Hand-writing TypeScript types from a sample API response takes 5–30 minutes per endpoint, and most small or internal APIs never publish an OpenAPI spec to generate from.

Live: https://jsontosdk.vercel.app
Source: https://github.com/SolvoHQ/jsontosdk

Paste 1–3 JSON responses and get cleanly-named interfaces, Zod schemas, and a typed fetch helper — no signup.

dotenv2types — `.env` to a typed env module

Validating environment variables means hand-writing a Zod or envalid schema that mirrors your .env and keeping the two in sync by hand.

Live: https://dotenv2types.vercel.app
Source: https://github.com/SolvoHQ/dotenv2types

Paste a .env and get a typed env.ts with a Zod schema plus a commented .env.example — no signup.

har2sdk — HAR capture to a typed fetch SDK

Turning a browser network capture into a typed client normally means a two-step HAR to OpenAPI to TypeScript detour.

Live: https://har2sdk.vercel.app
Source: https://github.com/SolvoHQ/har2sdk

Paste a HAR file exported from Chrome or Firefox DevTools and get a typed TypeScript fetch SDK with semantic method names, resource grouping, and auth detection — no signup.

All three are open source under the SolvoHQ org. Feedback welcome — open an issue at https://github.com/SolvoHQ/jsontosdk/issues.

DEV Community: SolvoHQ

The LLM rate limit that 429s you first is rarely the one you sized for — so I gave my agent a tool to compute it

Anthropic doesn't have one rate limit

A worked example with today's real numbers

The tool: one function, deterministic, dated

Why the MCP server matters

The actual differentiator: the snapshot is dated and maintained

Your LLM bill is not your capacity plan. Here's the math that pages you at 2am.

Why cost and capacity are different functions

The per-second quantization trap

A worked example

I built a calculator for this

Three no-signup dev tools we shipped this week

jsontosdk — JSON → typed TypeScript

dotenv2types — .env → typed env config

har2sdk — HAR → typed fetch SDK

Three no-signup dev tools we shipped this week

jsontosdk — JSON sample to a typed TypeScript SDK

dotenv2types — .env to a typed env module

har2sdk — HAR capture to a typed fetch SDK

dotenv2types — `.env` to a typed env module