<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: LLM Cap Planner</title>
    <description>The latest articles on DEV Community by LLM Cap Planner (@llmcapplanner).</description>
    <link>https://dev.to/llmcapplanner</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3933210%2F41e60419-727e-4046-9e4f-346e675541d5.png</url>
      <title>DEV Community: LLM Cap Planner</title>
      <link>https://dev.to/llmcapplanner</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/llmcapplanner"/>
    <language>en</language>
    <item>
      <title>The LLM 429 you didn't plan for: which rate-limit dimension binds first</title>
      <dc:creator>LLM Cap Planner</dc:creator>
      <pubDate>Fri, 15 May 2026 13:04:52 +0000</pubDate>
      <link>https://dev.to/llmcapplanner/the-llm-429-you-didnt-plan-for-which-rate-limit-dimension-binds-first-4e3l</link>
      <guid>https://dev.to/llmcapplanner/the-llm-429-you-didnt-plan-for-which-rate-limit-dimension-binds-first-4e3l</guid>
      <description>&lt;p&gt;Most LLM-app incidents I've watched over the past year were not model-quality problems. They were &lt;code&gt;429 Too Many Requests&lt;/code&gt;. And almost every team that hit one had sized capacity off a blog table that was already stale by the time they read it.&lt;/p&gt;

&lt;p&gt;This is a short writeup of that failure mode, the part of provider rate limiting that is genuinely under-documented, and a small client-side tool I built so I could stop guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  429s are the dominant production failure mode
&lt;/h2&gt;

&lt;p&gt;Aggregate production telemetry across LLM apps tells a consistent story: a meaningful fraction of all LLM call spans error, and the majority of those errors are rate-limit rejections — not 5xx, not timeouts. The reason is structural. Inference is expensive, so providers meter aggressively, and the default limits are low enough that a modest traffic increase crosses them. What makes it nasty is that you usually discover the ceiling by getting paged, not by reading docs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part people get wrong: limits are multi-dimensional and per-model
&lt;/h2&gt;

&lt;p&gt;Here is the nuance generic "cost calculators" miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anthropic meters at least three independent dimensions, separately:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RPM&lt;/strong&gt; — requests per minute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ITPM&lt;/strong&gt; — input tokens per minute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OTPM&lt;/strong&gt; — output tokens per minute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can sit at 10% of your RPM and still get 429ed because your average prompt is large and you hit ITPM first. A single combined "tokens per minute" number cannot represent this — the binding constraint depends on your input/output shape, not just your request rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;These limits are per model, not per tier.&lt;/strong&gt; This is the one that surprises people. At the same tier, Claude Opus and Claude Sonnet do not share an ITPM number — Opus's input-token allowance is many times larger than Sonnet's at an equivalent tier. Concretely, in the snapshot I maintain: at Tier 1, Opus 4.7 ITPM is 500,000 while Sonnet 4.6 ITPM is 30,000. Any tool that prints "Tier 1 = X tokens/min" without asking which model is structurally wrong.&lt;/p&gt;

&lt;p&gt;Two more Anthropic specifics worth knowing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cached input reads don't count toward ITPM.&lt;/strong&gt; Prompt-cache hits change your effective ceiling, so a cache-heavy workload has very different headroom than the naive math suggests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Per-minute" is enforced closer to per-second.&lt;/strong&gt; A 60 RPM limit is not "60 requests anywhere in a 60s window" — it behaves like roughly 1 request per second. A burst of 5 requests inside one second can 429 you while your per-minute average is comfortably under the cap. If your traffic is spiky, size for &lt;code&gt;ceil(RPM / 60)&lt;/code&gt; per second, not the per-minute figure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenAI meters four independent dimensions:&lt;/strong&gt; RPM and TPM, plus per-day RPD and TPD ceilings. The per-day ones specifically bite batch and backfill jobs — you pass every per-minute check and then die at hour 18.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a dated snapshot matters
&lt;/h2&gt;

&lt;p&gt;Stale tables are dangerous because of churn, not laziness. The model lineup moved twice in about five months — pricing and the available models both changed. Any capacity number you wrote down a year ago may describe models that no longer exist. A planning table is only useful if it carries the date it was true and gets re-verified against the provider dashboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool
&lt;/h2&gt;

&lt;p&gt;I put the math behind a single static page: &lt;a href="https://llmcapplanner.vercel.app/" rel="noopener noreferrer"&gt;https://llmcapplanner.vercel.app/&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick a provider + model + tier, enter requests/min and average input/output tokens.&lt;/li&gt;
&lt;li&gt;It shows projected monthly cost at sustained load &lt;strong&gt;and&lt;/strong&gt; which dimension (RPM / ITPM / OTPM / TPM) binds first, with the headroom on each.&lt;/li&gt;
&lt;li&gt;It carries a dated snapshot (currently 2026-05-15) with &lt;strong&gt;per-model&lt;/strong&gt; Anthropic limits, and flags the per-second quantization when RPM is the binding dimension.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is fully client-side and deterministic. No API calls, no signup, nothing leaves the browser — it is arithmetic over a dated constants table, not a service. The official provider doc links are in the footer so you can check every number against your own dashboard.&lt;/p&gt;

&lt;p&gt;Try it here: &lt;a href="https://llmcapplanner.vercel.app/" rel="noopener noreferrer"&gt;https://llmcapplanner.vercel.app/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The honest caveat: presets drift. If you spot a pricing or rate-limit number that has gone stale, please flag it — the maintenance is the entire point of the thing, and a wrong number that looks authoritative is worse than no number at all.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
