I hit OpenAI's rate limit at 2:47am on a Tuesday, 40 minutes into a bulk generation job for 800 articles.
The job crashed. Not gracefully — it just stopped. No retry, no queue fallback, no partial save. When I woke up, I had 312 completed articles sitting in a database and 488 that existed only as prompts in a spreadsheet.
That was six months ago. Since then, I've rebuilt PostAll's content generation pipeline three times. What follows is everything I've learned about handling LLM rate limits in a way that doesn't eat your users' data.
Why Rate Limits Hit Harder Than You Expect
OpenAI's limits come in two flavors that are easy to confuse until they separately ruin your day:
- RPM (Requests Per Minute) — how many API calls you can make
- TPM (Tokens Per Minute) — how many tokens you can process
You can hit TPM while staying well under RPM. This happens when your prompts are long (system prompt + few-shot examples + user content adds up fast). My 800-article job hit TPM about 40 minutes in because each article prompt was ~2,100 tokens — I'd done the math on RPM and felt safe. I hadn't done it on TPM.
Here's what a 429 Too Many Requests response from OpenAI actually looks like:
// The actual error object you'll receive
{
status: 429,
error: {
message: "Rate limit reached for gpt-4o in organization org-xxx on tokens per min (TPM): Limit 30000, Used 29847, Requested 1250. Please try again in 2.12s.",
type: "rate_limit_error",
code: "rate_limit_exceeded"
},
headers: {
"retry-after": "2", // seconds to wait (not always accurate)
"x-ratelimit-limit-tpm": "30000",
"x-ratelimit-remaining-tpm": "153",
"x-ratelimit-reset-tokens": "2.12s"
}
}
That retry-after header is actually useful — but almost every retry tutorial I've seen ignores it and hardcodes a delay instead. We'll use it.
The Three Patterns (And When Each One Breaks)
There's no single right answer here. There are three approaches, each suited to different workloads. I've used all three in PostAll at different points.
Pattern 1: Exponential Backoff with Jitter
Best for: Low-volume, latency-sensitive requests. User-facing generation where someone is waiting.
The idea: when you hit a 429, wait, then retry. Each failure doubles the wait time, plus a random jitter to prevent thundering herd problems (multiple clients all retrying at the exact same moment).
async function callWithBackoff(apiCallFn, options = {}) {
const {
maxRetries = 5,
baseDelay = 1000, // 1 second
maxDelay = 60000, // 60 seconds cap
jitterFactor = 0.3, // ±30% randomization
} = options;
let lastError;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await apiCallFn();
} catch (error) {
// Only retry on rate limit errors — fail fast on auth/invalid request
if (error.status !== 429) throw error;
lastError = error;
// Use the API's suggested retry time if available
const retryAfterMs = parseRetryAfter(error.headers?.['retry-after']);
const exponentialDelay = Math.min(
baseDelay * Math.pow(2, attempt),
maxDelay
);
// Take the larger of: what the API says vs our exponential backoff
const baseWait = Math.max(retryAfterMs, exponentialDelay);
// Add jitter: randomize ±jitterFactor around baseWait
const jitter = baseWait * jitterFactor * (Math.random() * 2 - 1);
const waitMs = Math.round(baseWait + jitter);
console.log(`Rate limited. Attempt ${attempt + 1}/${maxRetries}. Waiting ${waitMs}ms...`);
await sleep(waitMs);
}
}
throw new Error(`Failed after ${maxRetries} retries: ${lastError.message}`);
}
function parseRetryAfter(header) {
if (!header) return 0;
const seconds = parseFloat(header);
return isNaN(seconds) ? 0 : seconds * 1000;
}
const sleep = (ms) => new Promise(resolve => setTimeout(resolve, ms));
Where this breaks: At any meaningful volume, backoff becomes a waiting game. If 20 concurrent requests all hit a rate limit and start backing off with doubling delays, your total job time explodes. I tested this on a 100-article batch — the backoff alone added 23 minutes. That's when you need Pattern 2.
Pattern 2: Token Bucket Rate Limiter
Best for: Bulk jobs where you control the request rate. The right pattern for any batch processing over ~50 requests.
Instead of sending requests and handling rejection, a token bucket prevents you from exceeding your limits in the first place. You get a "bucket" of tokens that refills at your rate limit. Before each request, you check if there's a token available. If not, you wait until one refills.
class TokenBucketRateLimiter {
constructor({ requestsPerMinute, tokensPerMinute }) {
this.rpmLimit = requestsPerMinute;
this.tpmLimit = tokensPerMinute;
// Track usage in a rolling 60-second window
this.requestTimestamps = [];
this.tokenUsageLog = []; // [{ timestamp, tokens }]
}
async waitForCapacity(estimatedTokens) {
while (true) {
const now = Date.now();
const windowStart = now - 60_000; // 60-second rolling window
// Prune entries outside the window
this.requestTimestamps = this.requestTimestamps.filter(t => t > windowStart);
this.tokenUsageLog = this.tokenUsageLog.filter(e => e.timestamp > windowStart);
const currentRPM = this.requestTimestamps.length;
const currentTPM = this.tokenUsageLog.reduce((sum, e) => sum + e.tokens, 0);
const rpmOk = currentRPM < this.rpmLimit * 0.9; // 90% of limit — leave headroom
const tpmOk = (currentTPM + estimatedTokens) < this.tpmLimit * 0.9;
if (rpmOk && tpmOk) {
// Record this request before returning
this.requestTimestamps.push(now);
this.tokenUsageLog.push({ timestamp: now, tokens: estimatedTokens });
return;
}
// Wait 500ms and check again
// In practice, you'll rarely wait more than a couple seconds
await sleep(500);
}
}
}
// Usage
const limiter = new TokenBucketRateLimiter({
requestsPerMinute: 50, // your actual API tier limit
tokensPerMinute: 30_000, // your actual API tier limit
});
async function generateArticle(prompt, estimatedTokens) {
await limiter.waitForCapacity(estimatedTokens);
return openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
max_tokens: 1500,
});
}
One thing to get right: estimating tokens before the request. You don't know exact token count until after, but a rough estimate is fine. I use Math.ceil(prompt.length / 3.5) as a heuristic — it's close enough for rate limiting purposes.
Where this breaks: Single-process only. If you're running multiple workers or serverless functions, each has its own limiter with no shared state. At PostAll, we hit this when we scaled to parallel workers — each one thought it had full capacity. We were effectively multiplying our request rate by our worker count.
The fix for multi-process: move rate limiting state to Redis.
Pattern 3: Persistent Queue with Dead Letter Handling
Best for: Anything that cannot lose work. Long-running batch jobs, user-submitted content, anything where "we'll just retry later" isn't acceptable.
This is the pattern I should have used from the start. The core idea: jobs go into a queue before they ever touch the API. Workers pull from the queue, process, and mark complete. If they fail, they go to a dead letter queue for inspection — not silent failure.
// Using BullMQ (Redis-backed queue — works across multiple workers)
import { Queue, Worker } from 'bullmq';
import Redis from 'ioredis';
const connection = new Redis({ maxRetriesPerRequest: null });
// Queue setup
const contentQueue = new Queue('content-generation', { connection });
// Add jobs — this is all you do in your API route or trigger
async function queueArticles(articles) {
const jobs = articles.map(article => ({
name: 'generate-article',
data: { articleId: article.id, prompt: article.prompt },
opts: {
attempts: 8, // retry up to 8 times
backoff: {
type: 'exponential',
delay: 2000, // start at 2s, doubles each time
},
removeOnComplete: { count: 100 }, // keep last 100 completed jobs for inspection
removeOnFail: false, // NEVER auto-remove failed jobs
},
}));
await contentQueue.addBulk(jobs);
}
// Worker — runs separately, can scale horizontally
const worker = new Worker(
'content-generation',
async (job) => {
const { articleId, prompt } = job.data;
try {
const result = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
max_tokens: 1500,
});
const content = result.choices[0].message.content;
// Save before acknowledging job completion
await db.articles.update({
where: { id: articleId },
data: { content, status: 'completed', completedAt: new Date() },
});
return { articleId, tokensUsed: result.usage.total_tokens };
} catch (error) {
// Mark the article as failed in DB so users can see status
await db.articles.update({
where: { id: articleId },
data: { status: 'failed', lastError: error.message },
});
// Re-throw so BullMQ knows to retry
throw error;
}
},
{
connection,
concurrency: 5, // process 5 jobs at once — tune to your rate limits
limiter: {
max: 50, // max 50 jobs processed
duration: 60_000, // per 60 seconds — matches your RPM limit
},
}
);
// Dead letter queue handler — jobs that exhausted all retries
worker.on('failed', async (job, error) => {
if (job.attemptsMade >= job.opts.attempts) {
console.error(`Job ${job.id} moved to dead letter after ${job.attemptsMade} attempts:`, error.message);
// Alert your team, not just log — this needs human review
await notifySlack(`❌ Article ${job.data.articleId} failed permanently: ${error.message}`);
}
});
The dead letter queue is the part most tutorials skip. A job that fails 8 times isn't gone — it's sitting in your failed jobs list in Redis, with the full error trace and all its data intact. You can inspect it, fix whatever caused the failure, and re-queue it manually.
This is what I didn't have the night 488 articles disappeared.
What I Wish I'd Known: The Gotchas
1. retry-after headers lie, but only sometimes.
OpenAI's retry-after is usually accurate for TPM limits. For RPM limits, it's sometimes optimistic. I add a 200ms buffer on top of whatever it suggests. Costs almost nothing, prevents a follow-up 429.
2. Token estimation is more important than request estimation.
Most rate limit tutorials focus on RPM. In practice, if you're using GPT-4o with system prompts and few-shot examples, you'll hit TPM first. Count your tokens. The tiktoken library for Python, or js-tiktoken for Node, give you exact counts pre-request.
import { encoding_for_model } from 'js-tiktoken';
function estimateTokens(text, model = 'gpt-4o') {
const enc = encoding_for_model(model);
return enc.encode(text).length;
}
// Use this before queueing to set accurate TPM budget
const promptTokens = estimateTokens(systemPrompt + userPrompt);
3. Concurrency is a multiplier on your rate limit consumption.
If your rate limit is 50 RPM and you run 5 concurrent workers each attempting 50 RPM, you're trying to do 250 RPM. The limiter on each worker doesn't know about the others. Either use a shared Redis-based limiter, or divide your limit by your worker count.
4. Model-specific limits catch people off guard.
GPT-4o and GPT-3.5-turbo have completely separate rate limits on the same API key. If you're mixing models in one pipeline, each has its own bucket. Don't share a single rate limiter across them.
5. The OpenAI Tier system changes your limits more than you expect.
You start on Tier 1 (RPM: 500, TPM: 30,000). After $100 of spend, you move to Tier 2 (RPM: 5,000, TPM: 450,000). The jump is significant — but you have to actually spend the money, not just add it as credits. I had $50 of credits sitting unused for two weeks thinking I'd already unlocked Tier 2. I hadn't.
Putting It Together: The Pattern Selector
Here's how I decide which pattern to use:
| Scenario | Pattern |
|---|---|
| Single user waiting for result | Exponential backoff |
| Batch job, < 50 requests | Token bucket |
| Batch job, 50+ requests | Persistent queue |
| Multi-worker deployment | Persistent queue + Redis limiter |
| Can't lose data under any circumstance | Persistent queue + dead letter |
For PostAll, we use all three depending on context. Real-time single-article generation uses backoff. Scheduled batch jobs use the token bucket. User-submitted bulk jobs (the ones people pay for) use the persistent queue. Losing a user's 500-article job is not an option — the queue makes sure we never have to tell someone their work is gone.
The Complete Reference Implementation
The code above is functional, but split across examples. The full reference implementation with all three patterns, proper TypeScript types, and a simple test harness is in the GitHub repo:
github.com/PostAll-platform/rate-limit-patterns ← update with your actual link
It includes a simulate-429.ts script that mocks rate limit responses so you can test your retry logic without burning real API credits.
Building LLM-powered features in production means building them like the API will fail — because it will, at the worst possible moment. The pattern you choose matters less than the fact that you chose one before it mattered.
What's your retry strategy look like? Have you hit TPM limits specifically, or is RPM the one that gets you? I'm curious whether the TPM-first pattern holds for other use cases or if it's specific to the long-prompt workloads PostAll generates.
Top comments (0)