Hassann

Posted on May 13 • Originally published at apidog.com

GPT API rate limits: tiers, usage caps, and how to test them with Apidog

You ship a function that calls the GPT API. It works in staging. Then the first hundred production users hit it, and your logs fill with 429 Too Many Requests. Before changing code, you need to identify which limit failed: requests per minute, tokens per minute, daily quota, usage tier, or a model-specific cap.

Try Apidog today

💡 This guide shows you how to verify live GPT API limits with response headers, reproduce rate-limit failures with a small load test in Apidog, and save the workflow as a reusable request collection for your team.

OpenAI rate limits vary by model, endpoint, and usage tier. GPT-5.5 may not have the same limits as GPT-4.1. Image, audio, embedding, and batch endpoints use different dimensions. Your tier can also change as your spend grows.

Apidog gives you one workspace to:

Send GPT API requests.
Inspect x-ratelimit-* response headers.
Run concurrent request tests.
Capture 429 responses.
Reuse the same request collection across environments.

The four GPT API limits that matter

OpenAI applies several limits to every API key. In production, these are the ones you need to measure:

Limit	Meaning	Why it matters
RPM	Requests per minute	Usually the first limit hit by many small requests
TPM	Tokens per minute	Hit by large prompts, RAG payloads, and high `max_tokens` values
RPD	Requests per day	Common on free and lower-tier accounts
IPM / TPD / batch queue limits	Endpoint-specific limits	Used by image, audio, embedding, and batch APIs

When you exceed a limit, the API returns HTTP 429 with a body similar to:

{
  "error": {
    "message": "Rate limit reached for gpt-5.5 in organization org-abc on tokens per min (TPM): Limit 30000, Used 28432, Requested 3120.",
    "type": "tokens",
    "param": null,
    "code": "rate_limit_exceeded"
  }
}

Read the error body before changing your retry logic. The type and message usually tell you whether you hit:

tokens: token-per-minute pressure.
requests: request-per-minute pressure.
tokens_usage_based: usage-based token pressure.
quota/billing errors: account or payment limits, not request throttling.

A 429 caused by RPM needs a different fix than a 429 caused by TPM.

For HTTP-level behavior, see the MDN 429 documentation and RFC 6585. For OpenAI-specific limits, retry headers, and tier behavior, bookmark the official OpenAI rate-limits guide.

How usage tiers affect GPT API limits

Your API key belongs to an OpenAI organization, and that organization has a usage tier. The tier controls your RPM and TPM caps.

OpenAI tiers are based on:

Total account spend.
Time since your first successful payment.

A simplified tier shape for text models looks like this:

Tier	Spend gate	Wait gate	Text RPM	Text TPM
Free	none	none	3	40k
1	$5 paid	none	500	30k–200k by model
2	$50 paid	7 days	5,000	450k
3	$100 paid	7 days	5,000	1M
4	$250 paid	14 days	10,000	2M
5	$1,000 paid	30 days	10,000	2M+

These numbers are illustrative. Exact caps change over time and vary by model. Always verify your live limits from the dashboard or response headers before sizing production traffic.

Two operational details matter:

Tier promotion is automatic. When your spend crosses a tier gate and the wait gate has passed, later requests can run against higher caps.
Limits can change after billing changes. Payment failures, account inactivity, or spend limits can affect access. Re-test after billing changes.

For comparison with other providers, see the OpenAI API user rate limits explainer, Claude API rate limits guide, and Grok-3 API rate limits guide.

Read live GPT limits from response headers

You do not need to guess your current limits. GPT API responses include rate-limit headers.

Look for:

x-ratelimit-limit-requests
x-ratelimit-remaining-requests
x-ratelimit-limit-tokens
x-ratelimit-remaining-tokens

You may also see reset headers:

x-ratelimit-reset-requests
x-ratelimit-reset-tokens

Those reset values tell you how long until a bucket refills, for example 6s or 1m30s.

Use this workflow:

Send one cheap GPT request.
Inspect the response headers.
Record RPM and TPM.
Run a small burst test to confirm behavior at the limit.

Step 1: configure a GPT request in Apidog

Create a new Apidog project and add a request.

POST https://api.openai.com/v1/chat/completions

Add these headers:

Key	Value
`Authorization`	`Bearer {{OPENAI_API_KEY}}`
`Content-Type`	`application/json`

Use an Apidog environment variable for the API key:

OPENAI_API_KEY=sk-...

This keeps the key out of the saved request. You can create separate environments for:

Local testing.
Staging.
Production.
Personal keys.
Shared organization keys.

In the request body, choose JSON and paste:

{
  "model": "gpt-5.5",
  "messages": [
    {
      "role": "user",
      "content": "ping"
    }
  ],
  "max_tokens": 10
}

Send the request.

Then open the response headers and find the x-ratelimit-* values. These are your current limits for that model and endpoint.

Record at least:

x-ratelimit-limit-requests
x-ratelimit-remaining-requests
x-ratelimit-limit-tokens
x-ratelimit-remaining-tokens

If you want a more detailed Apidog setup walkthrough, see how to test the ChatGPT API with Apidog.

Step 2: confirm RPM with a burst test

A single request shows your limits, but it does not prove how your app behaves near the cap. Use a small burst test to reproduce throttling.

In Apidog:

Open the saved GPT request.
Click the dropdown next to Send.
Choose Run in Test Scenario.
Configure a short concurrent run.

Example settings:

Iterations: 50
Concurrency: 10
Delay between iterations: 0 ms

Run the scenario.

Useful outcomes:

Result	Meaning
Some responses return `429`	You confirmed the cap and can inspect the failure body
All responses return `200`	Your limit is higher than the burst, or another dimension is not exhausted
Headers decrement predictably	Your observed traffic matches the reported limit

After the run, sort responses by status code. Open each 429 and inspect the body.

If the message says RPM, your app is sending too many requests per minute. If it says TPM, your prompts or completions are too large.

For more examples of 429 responses, see the rate limit exceeded guide.

Step 3: separate RPM failures from TPM failures

The previous test uses tiny requests, so it mostly tests RPM. To test TPM, send fewer requests with larger payloads.

Change the body to something like:

{
  "model": "gpt-5.5",
  "messages": [
    {
      "role": "system",
      "content": "<3,000 tokens of context here>"
    },
    {
      "role": "user",
      "content": "Summarise the above in one sentence."
    }
  ],
  "max_tokens": 200
}

Then run a smaller scenario:

Iterations: 20
Concurrency: 5
Delay between iterations: 0 ms

If you are on a low TPM tier, you may hit token limits before request limits.

Use this decision table:

If you hit	Common cause	Fix
RPM	Too many small calls	Queue, batch, stagger, or cap concurrency
TPM	Large prompts or high `max_tokens`	Trim prompts, split requests, cache context, lower `max_tokens`
Daily quota	Free or lower-tier daily cap	Reduce usage, upgrade tier, or move non-urgent work to batch
Billing quota	Spend cap, failed payment, or zero balance	Fix billing settings

Step 4: simulate concurrent users

Production traffic is rarely a clean burst. You usually have:

Multiple users.
Different prompt sizes.
Spikes on top of baseline traffic.
Retries from your own app.
Background jobs competing with user-facing requests.

In Apidog, create a test scenario with small, medium, and large request variants.

Example request sizes:

Variant	Purpose
Small	Chat message or short classification
Medium	Typical app request
Large	RAG request with retrieved context

Use pre-request or post-request scripts to:

Pick a random request body.
Add random sleep between requests.
Read x-ratelimit-remaining-tokens.
Stop the scenario when remaining tokens drop below a threshold.
Track latency separately for 200 and 429 responses.

Pseudo-logic:

const remainingTokens = Number(response.headers.get("x-ratelimit-remaining-tokens"));

if (remainingTokens < 5000) {
  // Stop or slow down the scenario before hard throttling
}

When the run finishes, keep the status-code histogram in your runbook. The next time someone asks, “Are we rate-limited?”, rerun the same scenario and compare results.

What to do when GPT requests get throttled

Once you know which limit failed, use the right mitigation.

1. Back off using reset headers

For 429 responses, inspect:

x-ratelimit-reset-requests
x-ratelimit-reset-tokens

Use the relevant reset header as your first retry delay.

A basic exponential-backoff wrapper looks like this:

async function callWithRetry(fn, maxRetries = 5) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const res = await fn();

    if (res.status !== 429) {
      return res;
    }

    const resetTokens = res.headers.get("x-ratelimit-reset-tokens");
    const resetRequests = res.headers.get("x-ratelimit-reset-requests");

    const delayMs = parseResetHeader(resetTokens || resetRequests) ?? Math.pow(2, attempt) * 1000;

    await sleep(delayMs);
  }

  throw new Error("GPT API request failed after retries");
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

function parseResetHeader(value) {
  if (!value) return null;

  if (value.endsWith("ms")) return Number(value.replace("ms", ""));
  if (value.endsWith("s")) return Number(value.replace("s", "")) * 1000;
  if (value.endsWith("m")) return Number(value.replace("m", "")) * 60 * 1000;

  return null;
}

In production, also add jitter to avoid synchronized retries.

2. Queue requests below your cap

If traffic is bursty, do not send every request immediately. Put requests on a queue and drain below your RPM or TPM ceiling.

A simple rule:

safe_rpm = observed_rpm_limit * 0.7
safe_tpm = observed_tpm_limit * 0.7

Then configure workers to stay under those values.

For implementation patterns, see how to implement API rate limiting and implementing rate limiting in APIs.

3. Batch non-urgent work

If the workload does not need an immediate response, move it away from synchronous user traffic.

Good candidates:

Overnight enrichment.
Bulk classification.
Document processing.
Embedding rebuilds.
Data cleanup jobs.

OpenAI’s Batch API is designed for asynchronous workloads and can free synchronous quota for user-facing requests.

For terminology around throttling and limits, see throttle vs. rate limit.

Common GPT 429 errors

`Rate limit reached ... on requests per min (RPM)`

Your code is sending too many calls per minute.

Fixes:

Limit worker concurrency.
Avoid unbounded Promise.all.
Add a queue.
Stagger background jobs.
Batch small tasks where possible.

Example anti-pattern:

await Promise.all(records.map(record => callGpt(record)));

Safer pattern:

import pLimit from "p-limit";

const limit = pLimit(5);

await Promise.all(
  records.map(record => limit(() => callGpt(record)))
);

`Rate limit reached ... on tokens per min (TPM)`

Your requests consume too many tokens per minute.

Fixes:

Reduce system prompt size.
Lower max_tokens.
Split large documents.
Avoid sending full documents when excerpts are enough.
Cache repeated context.
Review RAG chunking and retrieval count.

`You exceeded your current quota, please check your plan and billing details`

This is usually a billing or quota issue, not a normal rate-limit problem.

Check:

Monthly spend cap.
Payment method.
Prepaid balance.
Organization billing settings.
Project-level limits.

Retry logic will not fix this class of error.

FAQ

Does Apidog cost anything to test GPT rate limits?

No. The free plan supports single-request testing and small concurrent test runs. Larger test loads, team workspaces, and scheduled runs may require a paid plan. See Apidog pricing.

Can I test rate limits without spending many tokens?

Partially.

For a cheap baseline check, send a tiny request:

{
  "model": "gpt-5.5",
  "messages": [
    {
      "role": "user",
      "content": "x"
    }
  ],
  "max_tokens": 1
}

The response headers still include rate-limit data.

Burst tests do consume real tokens, so keep each request small unless you are specifically testing TPM. For offline retry testing, use Apidog’s mock server to simulate 429 responses without calling OpenAI.

Why does my tier 1 key behave differently from a colleague’s tier 1 key?

Limits are usually applied at the organization level, not just the key level. If your key belongs to an organization with other active users, their traffic can consume shared capacity.

To compare:

Save the same Apidog request.
Create one environment per key.
Run the request with each environment.
Compare x-ratelimit-remaining-tokens and x-ratelimit-remaining-requests.

How do I know which model has which limit?

Send one cheap request to each model and read the headers.

Do not rely only on static tables. Model limits can vary by:

Model family.
Snapshot version.
Endpoint.
Organization tier.
Project settings.

For example, gpt-5.5 and gpt-5.5-0901 may not have the same caps.

Do streaming requests count differently?

Yes, especially for TPM. A streaming request can reserve tokens based on max_tokens, even if the actual completion is shorter.

Set max_tokens to the smallest realistic ceiling for the response.

For more on streaming setup, see how to test the ChatGPT API with Apidog.

Can I share my Apidog rate-limit test with my team?

Yes. Save the request and test scenario in a shared Apidog project. Teammates can run the same scenario with their own keys by switching environments.

That turns “is my key throttled or is the whole org throttled?” into a quick repeatable test.

DEV Community