Sindhu Murthy

Posted on Feb 17

API Rate Limits & Throttling: What's Actually Happening and How to Fix It

#ai #api #devops #beginners

**** Rate limiting is the #1 reason AI API calls fail in production. It's not a bug — it's the provider protecting their infrastructure. This guide explains what's happening, how to read the signals, and how to stop it from breaking your app.

The Scenario

Your app has been running fine for weeks. Then on a Monday morning, users start seeing errors. Not everyone — just some. The errors come and go. Sometimes the same question works on the second try.

Your logs are full of this:

HTTP 429 — Too Many Requests

You're being rate limited. And if you handle it wrong, you'll make it worse.

What Is Rate Limiting?

Think of a highway on-ramp with a traffic light. When too many cars try to merge at once, the light turns red and lets them through one at a time. Nobody's banned from the highway — they just have to wait their turn.

AI providers (OpenAI, Anthropic, Google) work the same way. When too many requests come in, they start telling some customers: "Slow down."

That's a rate limit. It's not an error in your code. It's the provider saying: "I can handle your request, just not right now."

Term	What It Means
Rate limit	Maximum number of requests allowed in a time window
Throttling	The provider actively slowing down or rejecting your requests
429 status code	The HTTP response that means "too many requests"
Quota	Your total allocation (per minute, per day, or per month)

The Three Types of Rate Limits

Most people think there's one rate limit. There are actually three, and they trigger independently.

Type	What It Limits	Example Limit	How You Hit It
Requests per minute (RPM)	Number of API calls	60 RPM	Sending too many questions, even short ones
Tokens per minute (TPM)	Total tokens processed	90,000 TPM	Sending fewer requests, but each one is huge (long documents, big prompts)
Tokens per day (TPD)	Daily token budget	1,000,000 TPD	Sustained high usage over hours

Important: You can hit TPM while staying under RPM. A single request with a 50,000-token document eats more than half your minute's budget. You only sent one request — but you're already throttled. Always check your provider's current documentation for exact limits — they change frequently and vary by tier.

How to Read a 429 Error

When you get rate limited, the provider doesn't just say "no." They tell you when to try again. Most people ignore this information.

The Response Headers

HTTP/1.1 429 Too Many Requests
retry-after: 2
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 12s
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-tokens: 28s

Header	What It Tells You
`retry-after`	Seconds to wait before trying again. Use this number.
`x-ratelimit-limit-requests`	Your RPM cap
`x-ratelimit-remaining-requests`	How many requests you have left this window
`x-ratelimit-reset-requests`	When your request limit resets
`x-ratelimit-limit-tokens`	Your TPM cap
`x-ratelimit-remaining-tokens`	How many tokens you have left this window
`x-ratelimit-reset-tokens`	When your token limit resets

Example: You get a 429. The retry-after header says 2. That means: wait 2 seconds and try again. Not 0 seconds. Not 30 seconds. Exactly 2. The provider is literally telling you the answer.

Status Codes: Which Errors to Retry

Not every error is a rate limit. Here's the simple rule:

Code	Meaning	Retry?	What to Do
429	Too Many Requests	Yes	Wait and retry with backoff
500	Server Error	Once	Try once more, then check the provider's status page
503	Service Unavailable	Yes	Provider is overloaded — wait and retry
400	Bad Request	No	Your request is malformed — fix your code
401	Unauthorized	No	Your API key is invalid or expired — fix it
403	Forbidden	No	Your key doesn't have permission for this model or action

The key rule: Only retry on 429, 500, and 503. Everything else means something is wrong on your end — retrying won't help.

The Retry Problem (And Why Most Teams Make It Worse)

Here's what happens when teams don't handle rate limits properly:

The Retry Storm

Request fails (429)
  → Code immediately retries
    → Also fails (429) — still in the same window
      → Code retries again
        → Also fails
          → 3 users are now each retrying 5 times
            → 15 requests where there were 3
              → Rate limit is now 5x worse

This is called a retry storm. Your retry logic is creating more traffic, which causes more 429s, which causes more retries. It's a death spiral.

Retry Approach	What Happens	Result
No retry	User sees an error	Bad UX, but no damage
Immediate retry	Same request hits the same limit	Retry storm — makes it worse
Fixed delay (wait 1s every time)	All retries fire at the same time	Thundering herd — same problem
Exponential backoff	Wait 1s, 2s, 4s, 8s	Spreads load, gives limits time to reset
Exponential backoff + jitter	Same as above + random 0-1s added	Prevents synchronized retries across users

The Right Way: Exponential Backoff with Jitter

Instead of retrying immediately (which makes things worse), wait a little longer each time:

First retry: wait ~1 second
Second retry: wait ~2 seconds
Third retry: wait ~4 seconds
Keep doubling up to a max of 5 retries
Still failing? Stop and show the user a helpful error

Add a small random delay ("jitter") to each wait so that multiple users don't all retry at the exact same moment.

That's it. Double the wait each time, add a pinch of randomness, and give up after 5 tries.

Preventing Rate Limits Before They Happen

Three strategies, in order of impact:

1. Request Queuing

Without a queue, every user hits the API directly. With a queue, your app controls the flow.

WITHOUT QUEUE:
  User A ──→ API
  User B ──→ API     →  100 simultaneous calls  →  429s
  User C ──→ API
  ...
  User Z ──→ API

WITH QUEUE:
  User A ──┐
  User B ──┤
  User C ──┼──→ Queue ──→ 10 requests/sec ──→ API  →  No 429s
  ...      │
  User Z ──┘

Users A and B get instant responses. User Z waits a few seconds. Nobody gets an error. The queue absorbs the traffic spike and releases it at a rate the API can handle.

2. Caching

If 200 users ask "How do I reset my password?" in one day — why call the API 200 times?

Strategy	How It Works	Best For
Exact match	Same question → cached answer	FAQs, common queries
Semantic cache	Similar questions → cached answer	Support bots, knowledge bases
TTL-based	Cache expires after X minutes	Data that changes periodically

Example: 200 identical questions per day. Without cache: 200 API calls. With cache: 1 API call + 199 cache hits. Rate limit usage drops by 99.5%.

3. Smaller Prompts

TPM limits are about total tokens. A 10,000-token request eats 100x more budget than a 100-token request.

Optimization	Token Savings
Send only relevant chunks, not full documents	30-60%
Shorter system prompts	10-20%
Summarize long docs with a cheap model first	50-70%

Monitoring: What to Watch

Don't wait for users to report 429s. Watch these numbers:

Metric	Warning	Critical	Action
RPM usage %	70% of limit	90% of limit	Enable queuing or caching
TPM usage %	70% of limit	90% of limit	Optimize prompt sizes
429 count/hour	Any	10+ per hour	Check for retry storms
Retry rate	5% of requests	15% of requests	Backoff isn't aggressive enough
P95 response time	5 seconds	15 seconds	Rate limit delays hitting UX
Daily token spend	70% of TPD	90% of TPD	Will run out of daily quota

Enterprise: The Noisy Neighbor Problem

One enterprise customer runs a batch job — 500 requests in a minute. Your shared API key gets rate limited. Now every customer is affected.

Problem	Solution
One customer blocks everyone	Per-tenant rate limiting — your app enforces limits per customer before hitting the API
Real-time chat delayed by batch jobs	Priority queues — chat requests go before batch jobs
Shared key runs out of quota	Separate API keys — different keys for different customers or use cases
Unpredictable usage spikes	Batch vs. real-time separation — batch jobs use a different key with lower priority

Troubleshooting Checklist

When 429s start showing up, work through this in order:

Step	What to Check
1	Check `x-ratelimit-remaining-requests` and `x-ratelimit-remaining-tokens` — which limit did you hit?
2	Is it RPM or TPM? Too many requests or too many tokens per request?
3	Check for retry storms — is your retry count multiplying the problem?
4	Check `retry-after` header — are you waiting the recommended time?
5	Check if one user or tenant is consuming disproportionate quota
6	Check prompt sizes — did someone add a huge system prompt or send large documents?
7	Check for duplicate requests — is the frontend sending the same request multiple times?
8	Check your tier — did you recently exceed a billing threshold that changes your limits?
9	Check provider status page — is the provider having capacity issues?
10	Check time of day — peak hours (US business hours) have tighter effective limits

Common Patterns Quick Reference

Symptom	Likely Cause	Fix
429s for everyone at once	Shared rate limit exhausted	Per-tenant limits or request queue
429s for one customer only	That customer is sending too much	Per-customer throttling
429s only during peak hours	Hitting RPM at high traffic times	Queue + cache
429s after deploying new feature	New feature sends more or larger requests	Audit token usage
429s that get worse over time	Retry storm	Exponential backoff + jitter
429s on token limit but low RPM	Sending very large prompts	Reduce context and prompt size
Intermittent 429s, no pattern	Hovering near the limit	Add 20% buffer below your limit
429s after a billing change	Tier downgrade reduced limits	Check provider dashboard for current tier

The Bottom Line

Rate limits aren't bugs. They're a feature of every AI API. The difference between a junior and senior engineer:

Junior: "The API is broken, it keeps returning errors."
Senior: "We're hitting our TPM limit during peak hours. I'm adding a request queue with exponential backoff and caching frequent queries. That should keep us under 70% utilization."

Know your limits. Monitor your usage. Retry smart, not fast. And when in doubt, check the headers — the answer is usually right there.

DEV Community