You're running a batch job, hitting the API in a loop, and suddenly everything stops. No error message in your app, just silence. Or maybe you see a 429 status code and have no idea what it means or how long to wait.
429 means "Too Many Requests." The server is telling you to slow down. Every API provider has rate limits — OpenAI, Anthropic, DeepSeek, all of them. The limits vary by plan, by model, and sometimes by time of day when the service is under heavy load.
There are two types of limits most providers enforce. Requests per minute (RPM) controls how many API calls you can make in a 60-second window. Tokens per minute (TPM) controls how much text you can process in that same window. You might hit either one depending on your usage pattern. A loop that sends 100 small requests fast will hit RPM. A single request with a huge prompt might hit TPM.
When you get a 429, the response usually includes a Retry-After header. This tells you how many seconds to wait before trying again. If the header isn't there, a safe default is to wait 60 seconds. Don't immediately retry — that makes the problem worse. The server already told you to back off, and hammering it again just extends your penalty.
If you're building an application that calls the API, implement retry logic with exponential backoff. Wait 1 second, then 2, then 4, then 8. Most HTTP libraries have this built in or as a plugin. Don't just wrap the call in a while loop with no delay — that's how you turn a temporary rate limit into a permanent ban.
For teams or projects that need higher limits, there are a few options. Upgrade your plan — most providers offer higher tiers with more generous limits. Spread requests across multiple keys if the provider allows it. Use a gateway that can distribute load across multiple upstream keys automatically.
One thing that trips people up: rate limits are usually per-key, not per-account. If you have 3 keys on the same account, each one has its own limit. But some providers also have account-level limits that are lower than the sum of your individual keys. Check the docs.
Another gotcha: some providers count failed requests against your rate limit. If your request is malformed and returns a 400, that still counts as a request. Fix your request format before retrying, or you'll burn through your limit on errors.
If you're hitting rate limits consistently, it's worth checking if your usage pattern can be optimized. Batch multiple messages into one request instead of sending them individually. Use streaming to get partial results faster instead of waiting for the full response. Cache results where possible — if you're asking the same question multiple times, store the answer.
Rate limits are annoying but they exist for a reason. Without them, a single bad actor could monopolize the service and everyone else suffers. The key is building your application to handle them gracefully instead of treating them as unexpected errors.
Top comments (0)