DEV Community

SolvoHQ
SolvoHQ

Posted on • Originally published at llmcapplanner.vercel.app

The LLM rate limit that 429s you first is rarely the one you sized for — so I gave my agent a tool to compute it

You size an LLM workload by looking at two numbers: the price per million tokens, and the requests-per-minute ceiling on the pricing page. You multiply, you eyeball the RPM limit, you decide you have headroom. Then you scale up and start eating 429 Too Many Requests — and the dimension that's throttling you is not the one you checked.

This is not a cost problem. It's a "which constraint binds first" problem, and the binding constraint moves depending on your token mix and your tier. Eyeballing the pricing page cannot tell you which one it is. So I built a deterministic tool that computes it — usable as a web app, and as an MCP server you plug into Claude or your coding agent so it answers capacity questions with arithmetic instead of a guess.

Anthropic doesn't have one rate limit

For Anthropic, a model+tier has three independent ceilings, all enforced per minute:

  • RPM — requests/minute
  • ITPM — input tokens/minute
  • OTPM — output tokens/minute

They are not proportional to each other, and the one that 429s you depends entirely on your average input/output token shape. A retrieval-heavy app with 8K-token prompts and 200-token answers is ITPM-bound. A short-prompt, long-generation agent loop is OTPM-bound. Same model, same tier, opposite binding dimension.

A worked example with today's real numbers

Using the live snapshot dated 2026-05-15, here are claude-sonnet-4-6 limits:

Tier RPM ITPM OTPM
Tier 1 50 30,000 8,000
Tier 4 4,000 2,000,000 400,000

Pricing: $3 / 1M input, $15 / 1M output.

Now take a real traffic profile: 600 requests/minute, 2,000 input tokens, 500 output tokens per request.

Per-minute demand:

  • RPM: 600
  • ITPM: 600 × 2,000 = 1,200,000
  • OTPM: 600 × 500 = 300,000

At Tier 4, utilization per dimension:

  • RPM: 600 / 4,000 = 15%
  • ITPM: 1,200,000 / 2,000,000 = 60%
  • OTPM: 300,000 / 400,000 = 75%

You are within all limits — but OTPM binds first at 75%. Not RPM (the number everyone checks: a comfortable 15%). Not ITPM. When you scale this workload, output tokens per minute is the wall you hit, and a quota increase on anything else buys you nothing. The binding dimension was non-obvious from the pricing page, and it would have flipped to ITPM if your prompts were larger.

Monthly cost for that profile: input is 600 × 2,000 × 60 × 24 × 30 = 51.84B tokens; output is 12.96B tokens. At $3 and $15 per million:

≈ $349,920 / month.

And at Tier 1? The same profile doesn't "cost more" — it doesn't run at all. ITPM demand of 1,200,000/min against a 30,000/min ceiling is a 40× overshoot; you hard-429 on ITPM immediately, long before RPM or cost is relevant. The constraint that ends you changes with the tier. Eyeballing the table won't surface that.

The tool: one function, deterministic, dated

The computation is pure arithmetic against a date-stamped snapshot — no provider API call, no network, fully offline. The web app is at https://llmcapplanner.vercel.app. The open-source code and the MCP server live at https://github.com/SolvoHQ/llmcapplanner (the server is in the mcp/ directory).

The MCP server exposes a single tool:

llm_capacity_plan(provider, model, tier, rpm, in_tok, out_tok)
  -> {
       monthly_cost,
       first_binding_429_dim,
       headroom_per_dim,
       will_429,
       snapshot_version
     }
Enter fullscreen mode Exit fullscreen mode

For the example above it returns, deterministically:

{
  "monthly_cost": 349920,
  "first_binding_429_dim": "OTPM",
  "headroom_per_dim": { "RPM": 3400, "ITPM": 800000, "OTPM": 100000 },
  "will_429": false,
  "snapshot_version": "2026-05-15"
}
Enter fullscreen mode Exit fullscreen mode

headroom_per_dim is per-minute slack; first_binding_429_dim is the one you must raise before scaling.

Why the MCP server matters

When your AI coding agent is asked "can we run this at 600 rpm on Sonnet Tier 4, and what will it cost?", it has two options: hallucinate a plausible-looking number, or call a tool that does the arithmetic. Wire this in and it does the latter — and every answer carries snapshot_version, so the agent (and you) know exactly how fresh the numbers are instead of trusting a model's stale memory of a 2024 pricing page.

The compiled server is committed to the repo, so it runs from a clone with no build step:

{
  "mcpServers": {
    "llmcapplanner": {
      "command": "node",
      "args": ["/absolute/path/to/llmcapplanner/mcp/dist/index.js"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Clone, point your MCP client at mcp/dist/index.js, done.

The actual differentiator: the snapshot is dated and maintained

Most LLM cost calculators are a hardcoded table someone typed in 2024 and never touched. That table is now wrong, and worse, it won't tell you it's wrong. Model lineups churn fast — in 2026 alone Anthropic moved Opus 4.6 → 4.7 and OpenAI went GPT-5.4 → 5.5, each with its own pricing and limit deltas. An undated table is not a convenience; it's a liability that produces confident wrong answers.

This tool's data is a single versioned snapshot (2026-05-15), and that version string rides along in every response — web and MCP. If the answer is stale, you can see that it's stale. That's the whole point: capacity math is only useful if you know the date it was true.


The numbers above were produced by the committed MCP server, not typed by hand. Try the web app at https://llmcapplanner.vercel.app, or read the source and run the server from https://github.com/SolvoHQ/llmcapplanner.

Top comments (0)