DEV Community

Cover image for Why Rate Limits Kill Your AI Agents in Production (And the Patterns That Actually Work)
Mudassir Khan
Mudassir Khan

Posted on

Why Rate Limits Kill Your AI Agents in Production (And the Patterns That Actually Work)

LLM API calls fail between 1% and 5% of the time in production. Not from hallucinations. From 429 errors nobody handled.

You've probably seen this: you ship an agent, everything works in staging, prod hits a burst of traffic, the provider throttles you, and suddenly your agent is retrying forever, burning tokens on every attempt, and the cost graph spikes sideways. The incident isn't model quality. It's the retry loop you forgot to fence.

I've written about production architecture for agentic systems before. Rate limiting is the piece that bites teams hardest, so let me go deep on it here.

Why rate limits catch most AI teams off guard

Most developers come to LLMs from REST APIs where rate limiting is mostly a nuisance you handle with one retry. With LLMs, the shape of the problem is different.

A single agent request isn't one API call. It's potentially dozens: planner calls, tool calls, summarizer calls, verifier calls. They all share the same rate window. One agent serving 10 simultaneous users can hit 200 to 300 API calls per minute before you realize what's happening.

The other thing that surprises teams: LLM rate limits often count tokens, not requests. Two requests at 50 tokens each and one request at 10,000 tokens are not equal, but your requests per minute counter treats them the same. You can stay under RPM and blow right past TPM.

The hidden cost: retry storms in multiagent systems

The most common production incident isn't a model giving the wrong answer. It's an agent that decides to retry, then retries again, each retry being a full provider call with no delay logic to protect the window.

This is the retry storm. You hit a rate limit, your agent retries immediately (or with a fixed 1 second delay), the retried call also hits the limit, all the retries queue up at the rate window boundary and fire at once, and now you've turned a temporary throttle into a sustained overload.

In multiagent systems it compounds. One orchestrator spawning five subagents, each doing their own uncoordinated retries, can turn a single 429 into 50 retry attempts within the same second. This is one of the core failure patterns I covered in why AI agents fail in production.

Proper rate limit handling can cut redundant API costs by 40%. That's not a small rounding error. That's architectural discipline that pays for itself.

Request count vs token count: why you're measuring the wrong thing

Your LLM provider gives you two limits: requests per minute (RPM) and tokens per minute (TPM). Most teams watch RPM. TPM is usually what breaks you.

Here's why: 1 request can be 50 tokens or 10,000 tokens. If you only count requests and stay under RPM, a single heavy prompt (large context, long output) can exhaust your TPM budget while you're still well under RPM. The next 30 requests all get 429s even though you've only made 5 calls that minute.

The fix is to count tokens on the way out, not after the call fails:

interface RateLimiter {
  requestTokens(estimatedTokens: number): Promise<void>;
  recordUsage(actualTokens: number): void;
}

class TokenBucketLimiter implements RateLimiter {
  private bucketTokens: number;
  private lastRefill: number;

  constructor(
    private tpmLimit: number,
    private refillIntervalMs = 60_000
  ) {
    this.bucketTokens = tpmLimit;
    this.lastRefill = Date.now();
  }

  async requestTokens(estimatedTokens: number): Promise<void> {
    this.refillIfNeeded();

    if (this.bucketTokens < estimatedTokens) {
      const waitMs = this.msUntilRefill();
      await new Promise(resolve => setTimeout(resolve, waitMs));
      this.refillIfNeeded();
    }

    this.bucketTokens -= estimatedTokens;
  }

  recordUsage(actualTokens: number): void {
    // Adjust the bucket if the actual usage differed from the estimate.
    // Track pre-reserved vs actual in a real implementation.
    void actualTokens;
  }

  private refillIfNeeded(): void {
    const now = Date.now();
    if (now - this.lastRefill >= this.refillIntervalMs) {
      this.bucketTokens = this.tpmLimit;
      this.lastRefill = now;
    }
  }

  private msUntilRefill(): number {
    return this.refillIntervalMs - (Date.now() - this.lastRefill);
  }
}
Enter fullscreen mode Exit fullscreen mode

Estimate tokens before the call using tiktoken or a rough char/4 heuristic, consume from the bucket, wait if you're over budget. This moves the rate limit behavior from reactive (catch the 429) to proactive (don't send the call that would fail).

The 3-layer architecture that actually works

The teams that handle rate limits cleanly aren't just retrying smarter. They're operating at three layers simultaneously.

Diagram showing the 3-layer architecture: Layer 1 is a token bucket per user and model, Layer 2 shows circuit breakers tripped by cost velocity, repeated prompts, and error rate, Layer 3 shows a fallback chain from primary model to cheaper model to semantic cache to 503

Layer 1: token bucket per (user, model). Limit each user's consumption independently. A single heavy user doesn't starve everyone else. Scope the bucket to both the user and the model so a cheap model and an expensive one don't compete for the same budget.

Layer 2: circuit breakers. Three signals should trip a circuit breaker:

  • Cost velocity: if a user is burning tokens at 3x their rolling average, something is looping
  • Repeated prompts: the same or very similar prompt fired multiple times in 10 seconds is probably a bug, not intent
  • Error rate: if 20% of a user's calls are 4xx in a 5 minute window, stop sending and surface the error instead of absorbing it silently

Layer 3: declarative fallback chain. Primary model → cheaper model (e.g. GPT-4o → GPT-4o mini) → semantic cache (return a stored response for similar queries) → 503. The chain is declarative, not imperative. You configure the fallback in one place and every agent inherits it.

TypeScript implementation: exponential backoff with jitter

The reason a fixed retry delay creates storm conditions is that all retried requests fire at the same moment. Jitter desynchronizes them.

interface RetryConfig {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  jitterFactor: number; // 0 to 1, how much randomness to add
}

async function withExponentialBackoff<T>(
  fn: () => Promise<T>,
  config: RetryConfig = {
    maxAttempts: 3,
    baseDelayMs: 1_000,
    maxDelayMs: 30_000,
    jitterFactor: 0.3,
  }
): Promise<T> {
  let lastError: Error;

  for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;

      if (!isRateLimitError(error) || attempt === config.maxAttempts - 1) {
        throw error;
      }

      const exponentialDelay = config.baseDelayMs * Math.pow(2, attempt);
      const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
      const jitter = cappedDelay * config.jitterFactor * Math.random();
      const finalDelay = cappedDelay + jitter;

      await new Promise(resolve => setTimeout(resolve, finalDelay));
    }
  }

  throw lastError!;
}

function isRateLimitError(error: unknown): boolean {
  if (error instanceof Error) {
    return (
      error.message.includes('429') ||
      error.message.toLowerCase().includes('rate limit')
    );
  }
  return false;
}
Enter fullscreen mode Exit fullscreen mode

The key is Math.random() in the jitter calculation. Two simultaneous retries sleep for different durations and arrive at the provider at different moments. At scale this turns a synchronized wave into a spread distribution.

Also worth checking: the OpenAI 429 response includes a Retry-After header telling you exactly how many seconds to wait. Parse it and honor it directly instead of running your own backoff math.

Wiring in a circuit breaker (code example)

A circuit breaker wraps your LLM client and opens (stops sending) when the error rate crosses a threshold. Here's a minimal implementation:

type CircuitState = 'closed' | 'open' | 'half-open';

class LLMCircuitBreaker {
  private state: CircuitState = 'closed';
  private failureCount = 0;
  private lastFailureTime = 0;
  private readonly failureThreshold = 5;
  private readonly recoveryTimeMs = 60_000;

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime > this.recoveryTimeMs) {
        this.state = 'half-open';
      } else {
        throw new Error(
          'Circuit open: LLM provider is throttling. Try again shortly.'
        );
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    this.failureCount = 0;
    this.state = 'closed';
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'open';
    }
  }
}

// Wire the two layers together:
const breaker = new LLMCircuitBreaker();
const limiter = new TokenBucketLimiter(100_000);

async function safeLLMCall(prompt: string, estimatedTokens: number) {
  await limiter.requestTokens(estimatedTokens);
  return breaker.call(() =>
    openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: prompt }],
    })
  );
}
Enter fullscreen mode Exit fullscreen mode

The circuit opens after 5 failures and stays open for 60 seconds. During that window, requests fail fast instead of piling up waiting for a timeout. After 60 seconds it shifts to half-open: one test call goes through, and if it succeeds the circuit closes again. If it fails, the clock resets.

Put this at the call site of every provider interaction and you've got a fence around every agent that uses it.

FAQ

What causes LLM rate limit errors?

Two things. You've hit the provider's RPM or TPM ceiling for your account tier, or a single request exceeded the context window limit. Check both when you see a 429. The error response usually tells you which limit you hit.

How do I handle 429 errors from OpenAI?

The OpenAI 429 response includes a Retry-After header with how many seconds to wait. Parse it and sleep for that duration before retrying. The header value is more reliable than any backoff formula you'll calculate yourself.

What is exponential backoff for APIs?

Instead of retrying immediately or on a fixed interval, each retry waits longer than the previous one: 1 second, then 2 seconds, then 4 seconds, then 8 seconds. Adding random jitter to each wait time prevents all retriers from firing at the same moment and overloading the provider again.

How do I prevent agent retry storms?

Two controls working together. A circuit breaker that opens after N consecutive failures stops new calls from going out while the provider is saturated. A token bucket that estimates usage before sending catches bursts before they hit the API. The combination prevents the feedback loop where retries cause more rate limits cause more retries.


If you're building out the rest of the agent reliability layer, I go deeper into the architectural patterns on my blog.

If you want this wired up on your own stack end to end, agentic AI consulting is exactly the kind of work I take on.


Drop a comment if your rate limit setup looks different. Curious whether people are managing this at the SDK layer or at an API gateway.

Top comments (0)