You ship a function that calls the GPT API. It works in staging. Then the first hundred production users hit it, and your logs fill with 429 Too Many Requests. Before changing code, you need to identify which limit failed: requests per minute, tokens per minute, daily quota, usage tier, or a model-specific cap.
đź’ˇ This guide shows you how to verify live GPT API limits with response headers, reproduce rate-limit failures with a small load test in Apidog, and save the workflow as a reusable request collection for your team.
OpenAI rate limits vary by model, endpoint, and usage tier. GPT-5.5 may not have the same limits as GPT-4.1. Image, audio, embedding, and batch endpoints use different dimensions. Your tier can also change as your spend grows.
Apidog gives you one workspace to:
- Send GPT API requests.
- Inspect
x-ratelimit-*response headers. - Run concurrent request tests.
- Capture 429 responses.
- Reuse the same request collection across environments.
The four GPT API limits that matter
OpenAI applies several limits to every API key. In production, these are the ones you need to measure:
| Limit | Meaning | Why it matters |
|---|---|---|
| RPM | Requests per minute | Usually the first limit hit by many small requests |
| TPM | Tokens per minute | Hit by large prompts, RAG payloads, and high max_tokens values |
| RPD | Requests per day | Common on free and lower-tier accounts |
| IPM / TPD / batch queue limits | Endpoint-specific limits | Used by image, audio, embedding, and batch APIs |
When you exceed a limit, the API returns HTTP 429 with a body similar to:
{
"error": {
"message": "Rate limit reached for gpt-5.5 in organization org-abc on tokens per min (TPM): Limit 30000, Used 28432, Requested 3120.",
"type": "tokens",
"param": null,
"code": "rate_limit_exceeded"
}
}
Read the error body before changing your retry logic. The type and message usually tell you whether you hit:
-
tokens: token-per-minute pressure. -
requests: request-per-minute pressure. -
tokens_usage_based: usage-based token pressure. - quota/billing errors: account or payment limits, not request throttling.
A 429 caused by RPM needs a different fix than a 429 caused by TPM.
For HTTP-level behavior, see the MDN 429 documentation and RFC 6585. For OpenAI-specific limits, retry headers, and tier behavior, bookmark the official OpenAI rate-limits guide.
How usage tiers affect GPT API limits
Your API key belongs to an OpenAI organization, and that organization has a usage tier. The tier controls your RPM and TPM caps.
OpenAI tiers are based on:
- Total account spend.
- Time since your first successful payment.
A simplified tier shape for text models looks like this:
| Tier | Spend gate | Wait gate | Text RPM | Text TPM |
|---|---|---|---|---|
| Free | none | none | 3 | 40k |
| 1 | $5 paid | none | 500 | 30k–200k by model |
| 2 | $50 paid | 7 days | 5,000 | 450k |
| 3 | $100 paid | 7 days | 5,000 | 1M |
| 4 | $250 paid | 14 days | 10,000 | 2M |
| 5 | $1,000 paid | 30 days | 10,000 | 2M+ |
These numbers are illustrative. Exact caps change over time and vary by model. Always verify your live limits from the dashboard or response headers before sizing production traffic.
Two operational details matter:
- Tier promotion is automatic. When your spend crosses a tier gate and the wait gate has passed, later requests can run against higher caps.
- Limits can change after billing changes. Payment failures, account inactivity, or spend limits can affect access. Re-test after billing changes.
For comparison with other providers, see the OpenAI API user rate limits explainer, Claude API rate limits guide, and Grok-3 API rate limits guide.
Read live GPT limits from response headers
You do not need to guess your current limits. GPT API responses include rate-limit headers.
Look for:
x-ratelimit-limit-requests
x-ratelimit-remaining-requests
x-ratelimit-limit-tokens
x-ratelimit-remaining-tokens
You may also see reset headers:
x-ratelimit-reset-requests
x-ratelimit-reset-tokens
Those reset values tell you how long until a bucket refills, for example 6s or 1m30s.
Use this workflow:
- Send one cheap GPT request.
- Inspect the response headers.
- Record RPM and TPM.
- Run a small burst test to confirm behavior at the limit.
Step 1: configure a GPT request in Apidog
Create a new Apidog project and add a request.
POST https://api.openai.com/v1/chat/completions
Add these headers:
| Key | Value |
|---|---|
Authorization |
Bearer {{OPENAI_API_KEY}} |
Content-Type |
application/json |
Use an Apidog environment variable for the API key:
OPENAI_API_KEY=sk-...
This keeps the key out of the saved request. You can create separate environments for:
- Local testing.
- Staging.
- Production.
- Personal keys.
- Shared organization keys.
In the request body, choose JSON and paste:
{
"model": "gpt-5.5",
"messages": [
{
"role": "user",
"content": "ping"
}
],
"max_tokens": 10
}
Send the request.
Then open the response headers and find the x-ratelimit-* values. These are your current limits for that model and endpoint.
Record at least:
x-ratelimit-limit-requests
x-ratelimit-remaining-requests
x-ratelimit-limit-tokens
x-ratelimit-remaining-tokens
If you want a more detailed Apidog setup walkthrough, see how to test the ChatGPT API with Apidog.
Step 2: confirm RPM with a burst test
A single request shows your limits, but it does not prove how your app behaves near the cap. Use a small burst test to reproduce throttling.
In Apidog:
- Open the saved GPT request.
- Click the dropdown next to Send.
- Choose Run in Test Scenario.
- Configure a short concurrent run.
Example settings:
Iterations: 50
Concurrency: 10
Delay between iterations: 0 ms
Run the scenario.
Useful outcomes:
| Result | Meaning |
|---|---|
Some responses return 429
|
You confirmed the cap and can inspect the failure body |
All responses return 200
|
Your limit is higher than the burst, or another dimension is not exhausted |
| Headers decrement predictably | Your observed traffic matches the reported limit |
After the run, sort responses by status code. Open each 429 and inspect the body.
If the message says RPM, your app is sending too many requests per minute. If it says TPM, your prompts or completions are too large.
For more examples of 429 responses, see the rate limit exceeded guide.
Step 3: separate RPM failures from TPM failures
The previous test uses tiny requests, so it mostly tests RPM. To test TPM, send fewer requests with larger payloads.
Change the body to something like:
{
"model": "gpt-5.5",
"messages": [
{
"role": "system",
"content": "<3,000 tokens of context here>"
},
{
"role": "user",
"content": "Summarise the above in one sentence."
}
],
"max_tokens": 200
}
Then run a smaller scenario:
Iterations: 20
Concurrency: 5
Delay between iterations: 0 ms
If you are on a low TPM tier, you may hit token limits before request limits.
Use this decision table:
| If you hit | Common cause | Fix |
|---|---|---|
| RPM | Too many small calls | Queue, batch, stagger, or cap concurrency |
| TPM | Large prompts or high max_tokens
|
Trim prompts, split requests, cache context, lower max_tokens
|
| Daily quota | Free or lower-tier daily cap | Reduce usage, upgrade tier, or move non-urgent work to batch |
| Billing quota | Spend cap, failed payment, or zero balance | Fix billing settings |
Step 4: simulate concurrent users
Production traffic is rarely a clean burst. You usually have:
- Multiple users.
- Different prompt sizes.
- Spikes on top of baseline traffic.
- Retries from your own app.
- Background jobs competing with user-facing requests.
In Apidog, create a test scenario with small, medium, and large request variants.
Example request sizes:
| Variant | Purpose |
|---|---|
| Small | Chat message or short classification |
| Medium | Typical app request |
| Large | RAG request with retrieved context |
Use pre-request or post-request scripts to:
- Pick a random request body.
- Add random sleep between requests.
- Read
x-ratelimit-remaining-tokens. - Stop the scenario when remaining tokens drop below a threshold.
- Track latency separately for
200and429responses.
Pseudo-logic:
const remainingTokens = Number(response.headers.get("x-ratelimit-remaining-tokens"));
if (remainingTokens < 5000) {
// Stop or slow down the scenario before hard throttling
}
When the run finishes, keep the status-code histogram in your runbook. The next time someone asks, “Are we rate-limited?”, rerun the same scenario and compare results.
What to do when GPT requests get throttled
Once you know which limit failed, use the right mitigation.
1. Back off using reset headers
For 429 responses, inspect:
x-ratelimit-reset-requests
x-ratelimit-reset-tokens
Use the relevant reset header as your first retry delay.
A basic exponential-backoff wrapper looks like this:
async function callWithRetry(fn, maxRetries = 5) {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const res = await fn();
if (res.status !== 429) {
return res;
}
const resetTokens = res.headers.get("x-ratelimit-reset-tokens");
const resetRequests = res.headers.get("x-ratelimit-reset-requests");
const delayMs = parseResetHeader(resetTokens || resetRequests) ?? Math.pow(2, attempt) * 1000;
await sleep(delayMs);
}
throw new Error("GPT API request failed after retries");
}
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
function parseResetHeader(value) {
if (!value) return null;
if (value.endsWith("ms")) return Number(value.replace("ms", ""));
if (value.endsWith("s")) return Number(value.replace("s", "")) * 1000;
if (value.endsWith("m")) return Number(value.replace("m", "")) * 60 * 1000;
return null;
}
In production, also add jitter to avoid synchronized retries.
2. Queue requests below your cap
If traffic is bursty, do not send every request immediately. Put requests on a queue and drain below your RPM or TPM ceiling.
A simple rule:
safe_rpm = observed_rpm_limit * 0.7
safe_tpm = observed_tpm_limit * 0.7
Then configure workers to stay under those values.
For implementation patterns, see how to implement API rate limiting and implementing rate limiting in APIs.
3. Batch non-urgent work
If the workload does not need an immediate response, move it away from synchronous user traffic.
Good candidates:
- Overnight enrichment.
- Bulk classification.
- Document processing.
- Embedding rebuilds.
- Data cleanup jobs.
OpenAI’s Batch API is designed for asynchronous workloads and can free synchronous quota for user-facing requests.
For terminology around throttling and limits, see throttle vs. rate limit.
Common GPT 429 errors
Rate limit reached ... on requests per min (RPM)
Your code is sending too many calls per minute.
Fixes:
- Limit worker concurrency.
- Avoid unbounded
Promise.all. - Add a queue.
- Stagger background jobs.
- Batch small tasks where possible.
Example anti-pattern:
await Promise.all(records.map(record => callGpt(record)));
Safer pattern:
import pLimit from "p-limit";
const limit = pLimit(5);
await Promise.all(
records.map(record => limit(() => callGpt(record)))
);
Rate limit reached ... on tokens per min (TPM)
Your requests consume too many tokens per minute.
Fixes:
- Reduce system prompt size.
- Lower
max_tokens. - Split large documents.
- Avoid sending full documents when excerpts are enough.
- Cache repeated context.
- Review RAG chunking and retrieval count.
You exceeded your current quota, please check your plan and billing details
This is usually a billing or quota issue, not a normal rate-limit problem.
Check:
- Monthly spend cap.
- Payment method.
- Prepaid balance.
- Organization billing settings.
- Project-level limits.
Retry logic will not fix this class of error.
FAQ
Does Apidog cost anything to test GPT rate limits?
No. The free plan supports single-request testing and small concurrent test runs. Larger test loads, team workspaces, and scheduled runs may require a paid plan. See Apidog pricing.
Can I test rate limits without spending many tokens?
Partially.
For a cheap baseline check, send a tiny request:
{
"model": "gpt-5.5",
"messages": [
{
"role": "user",
"content": "x"
}
],
"max_tokens": 1
}
The response headers still include rate-limit data.
Burst tests do consume real tokens, so keep each request small unless you are specifically testing TPM. For offline retry testing, use Apidog’s mock server to simulate 429 responses without calling OpenAI.
Why does my tier 1 key behave differently from a colleague’s tier 1 key?
Limits are usually applied at the organization level, not just the key level. If your key belongs to an organization with other active users, their traffic can consume shared capacity.
To compare:
- Save the same Apidog request.
- Create one environment per key.
- Run the request with each environment.
- Compare
x-ratelimit-remaining-tokensandx-ratelimit-remaining-requests.
How do I know which model has which limit?
Send one cheap request to each model and read the headers.
Do not rely only on static tables. Model limits can vary by:
- Model family.
- Snapshot version.
- Endpoint.
- Organization tier.
- Project settings.
For example, gpt-5.5 and gpt-5.5-0901 may not have the same caps.
Do streaming requests count differently?
Yes, especially for TPM. A streaming request can reserve tokens based on max_tokens, even if the actual completion is shorter.
Set max_tokens to the smallest realistic ceiling for the response.
For more on streaming setup, see how to test the ChatGPT API with Apidog.
Can I share my Apidog rate-limit test with my team?
Yes. Save the request and test scenario in a shared Apidog project. Teammates can run the same scenario with their own keys by switching environments.
That turns “is my key throttled or is the whole org throttled?” into a quick repeatable test.
Top comments (0)