What happens when an AI agent hits a rate limit — and how to design around it

#agents #ratelimit #design #openai

Originally published at awx-shredder.fly.dev/blog

What happens when an AI agent hits a rate limit — and how to design around it

Your AI agent is processing customer support tickets at 3 AM. It's been running flawlessly for hours, then suddenly: RateLimitError: You exceeded your current quota. The agent crashes. Thirty tickets sit in limbo. Your on-call phone rings.

This isn't a hypothetical. Rate limits and budget exhaustion are distinct failure modes with different blast radiuses, and most developers conflate them until production teaches them otherwise.

Rate limits vs budget limits: different animals

A rate limit restricts requests per time window — 3,500 requests per minute for GPT-4, for example. Cross it and you get a 429 status code. Wait 60 seconds and you're back in business.

A budget limit is about cumulative spend. Once you've burned through your daily or monthly allocation, you're done until the reset. The API returns 429 with insufficient_quota as the error type, but the fix isn't waiting — it's either increasing your budget or stopping work entirely.

The failure modes differ:

Rate limit: Temporary. Backoff and retry works.
Budget limit: Terminal for that billing period. Retry loops just burn CPU.

Yet both return 429. Your error handling needs to distinguish them.

Parsing the error correctly

OpenAI's Python SDK raises RateLimitError for both. The distinction lives in the error message or response headers. Here's how to differentiate:

from openai import OpenAI, RateLimitError
import time

client = OpenAI()

def call_with_smart_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4",
                messages=messages
            )
        except RateLimitError as e:
            error_message = str(e).lower()

            # Budget exhausted - don't retry
            if "quota" in error_message or "insufficient" in error_message:
                print(f"Budget exhausted: {e}")
                # Log to monitoring, alert ops, gracefully degrade
                raise BudgetExhaustedError("Daily budget hit") from e

            # Rate limit - exponential backoff
            if attempt < max_retries - 1:
                wait_time = (2 ** attempt) + (random.random() * 0.1)
                print(f"Rate limited, backing off {wait_time:.2f}s")
                time.sleep(wait_time)
            else:
                raise

class BudgetExhaustedError(Exception):
    pass

This prevents the classic mistake: retry loops that hammer the API when you're out of budget, racking up failed request logs and wasting cycles.

Backoff strategies that actually work

Exponential backoff with jitter is table stakes. The jitter (random component) prevents thundering herds when multiple agents hit limits simultaneously.

But there's a subtlety: OpenAI returns retry-after headers on rate limit responses. Respect them:

except RateLimitError as e:
    retry_after = e.response.headers.get('retry-after')
    if retry_after:
        wait_time = int(retry_after)
    else:
        wait_time = min((2 ** attempt) + random.random(), 60)
    time.sleep(wait_time)

Adaptive rate limiting is the next level. Track your request success rate and slow down proactively before hitting limits:

class AdaptiveRateLimiter:
    def __init__(self, initial_rate=10):
        self.requests_per_second = initial_rate
        self.window_start = time.time()
        self.requests_in_window = 0

    def acquire(self):
        now = time.time()
        if now - self.window_start >= 1.0:
            self.window_start = now
            self.requests_in_window = 0

        if self.requests_in_window >= self.requests_per_second:
            sleep_time = 1.0 - (now - self.window_start)
            if sleep_time > 0:
                time.sleep(sleep_time)
            self.window_start = time.time()
            self.requests_in_window = 0

        self.requests_in_window += 1

    def on_rate_limit(self):
        # Reduce rate by 50% when we hit a limit
        self.requests_per_second = max(1, self.requests_per_second * 0.5)

    def on_success(self):
        # Gradually increase rate by 10% on sustained success
        self.requests_per_second = min(100, self.requests_per_second * 1.1)

Queue design for resilient agents

The real solution isn't better retry logic — it's building agents that fail gracefully. Use a persistent queue:

Accept work into a queue (Redis, SQS, PostgreSQL with SKIP LOCKED)
Workers pull from the queue with visibility timeouts
On rate limit: Release the message back to the queue, don't retry immediately
On budget exhaustion: Stop pulling from the queue entirely, alert, and wait for budget reset

This architecture decouples work acceptance from execution. When you hit limits, work queues up instead of erroring out.

# Pseudocode for queue-based processing
while True:
    message = queue.receive(wait_time=20)
    if not message:
        continue

    try:
        result = call_with_smart_retry(message.data)
        message.delete()
    except BudgetExhaustedError:
        message.release()  # Back to queue
        print("Budget exhausted, sleeping until reset")
        time.sleep(3600)  # Check hourly
    except RateLimitError:
        message.release()  # Back to queue with delay
        time.sleep(10)  # Brief pause before next pull

Hard budget enforcement

If you need guaranteed budget enforcement at the API level rather than in your application logic, AWX Shredder provides a proxy layer that blocks requests the moment an agent exceeds its daily budget. It's OpenAI-compatible and requires only a base URL change: OPENAI_BASE_URL=https://awx-shredder.fly.dev/proxy/v1. This prevents the scenario where your retry logic has bugs or a runaway agent burns through budget before your application-level checks catch it.

What to implement today

Add error type detection to your LLM calls. Distinguish rate limits from budget exhaustion.
Implement exponential backoff with jitter, respecting retry-after headers.
Move to queue-based processing if you're doing any multi-request workflows.
Set up monitoring for rate limit and budget exhaustion events. These should page someone.

The difference between a resilient AI agent and a fragile one isn't the model you use — it's how you handle the inevitable moment when the API says "no."