Originally published at awx-shredder.fly.dev/blog
What happens when an AI agent hits a rate limit — and how to design around it
Your AI agent is processing customer support tickets at 3 AM. It's been running flawlessly for hours, then suddenly: RateLimitError: You exceeded your current quota. The agent crashes. Thirty tickets sit in limbo. Your on-call phone rings.
This isn't a hypothetical. Rate limits and budget exhaustion are distinct failure modes with different blast radiuses, and most developers conflate them until production teaches them otherwise.
Rate limits vs budget limits: different animals
A rate limit restricts requests per time window — 3,500 requests per minute for GPT-4, for example. Cross it and you get a 429 status code. Wait 60 seconds and you're back in business.
A budget limit is about cumulative spend. Once you've burned through your daily or monthly allocation, you're done until the reset. The API returns 429 with insufficient_quota as the error type, but the fix isn't waiting — it's either increasing your budget or stopping work entirely.
The failure modes differ:
- Rate limit: Temporary. Backoff and retry works.
- Budget limit: Terminal for that billing period. Retry loops just burn CPU.
Yet both return 429. Your error handling needs to distinguish them.
Parsing the error correctly
OpenAI's Python SDK raises RateLimitError for both. The distinction lives in the error message or response headers. Here's how to differentiate:
from openai import OpenAI, RateLimitError
import time
client = OpenAI()
def call_with_smart_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-4",
messages=messages
)
except RateLimitError as e:
error_message = str(e).lower()
# Budget exhausted - don't retry
if "quota" in error_message or "insufficient" in error_message:
print(f"Budget exhausted: {e}")
# Log to monitoring, alert ops, gracefully degrade
raise BudgetExhaustedError("Daily budget hit") from e
# Rate limit - exponential backoff
if attempt < max_retries - 1:
wait_time = (2 ** attempt) + (random.random() * 0.1)
print(f"Rate limited, backing off {wait_time:.2f}s")
time.sleep(wait_time)
else:
raise
class BudgetExhaustedError(Exception):
pass
This prevents the classic mistake: retry loops that hammer the API when you're out of budget, racking up failed request logs and wasting cycles.
Backoff strategies that actually work
Exponential backoff with jitter is table stakes. The jitter (random component) prevents thundering herds when multiple agents hit limits simultaneously.
But there's a subtlety: OpenAI returns retry-after headers on rate limit responses. Respect them:
except RateLimitError as e:
retry_after = e.response.headers.get('retry-after')
if retry_after:
wait_time = int(retry_after)
else:
wait_time = min((2 ** attempt) + random.random(), 60)
time.sleep(wait_time)
Adaptive rate limiting is the next level. Track your request success rate and slow down proactively before hitting limits:
class AdaptiveRateLimiter:
def __init__(self, initial_rate=10):
self.requests_per_second = initial_rate
self.window_start = time.time()
self.requests_in_window = 0
def acquire(self):
now = time.time()
if now - self.window_start >= 1.0:
self.window_start = now
self.requests_in_window = 0
if self.requests_in_window >= self.requests_per_second:
sleep_time = 1.0 - (now - self.window_start)
if sleep_time > 0:
time.sleep(sleep_time)
self.window_start = time.time()
self.requests_in_window = 0
self.requests_in_window += 1
def on_rate_limit(self):
# Reduce rate by 50% when we hit a limit
self.requests_per_second = max(1, self.requests_per_second * 0.5)
def on_success(self):
# Gradually increase rate by 10% on sustained success
self.requests_per_second = min(100, self.requests_per_second * 1.1)
Queue design for resilient agents
The real solution isn't better retry logic — it's building agents that fail gracefully. Use a persistent queue:
- Accept work into a queue (Redis, SQS, PostgreSQL with SKIP LOCKED)
- Workers pull from the queue with visibility timeouts
- On rate limit: Release the message back to the queue, don't retry immediately
- On budget exhaustion: Stop pulling from the queue entirely, alert, and wait for budget reset
This architecture decouples work acceptance from execution. When you hit limits, work queues up instead of erroring out.
# Pseudocode for queue-based processing
while True:
message = queue.receive(wait_time=20)
if not message:
continue
try:
result = call_with_smart_retry(message.data)
message.delete()
except BudgetExhaustedError:
message.release() # Back to queue
print("Budget exhausted, sleeping until reset")
time.sleep(3600) # Check hourly
except RateLimitError:
message.release() # Back to queue with delay
time.sleep(10) # Brief pause before next pull
Hard budget enforcement
If you need guaranteed budget enforcement at the API level rather than in your application logic, AWX Shredder provides a proxy layer that blocks requests the moment an agent exceeds its daily budget. It's OpenAI-compatible and requires only a base URL change: OPENAI_BASE_URL=https://awx-shredder.fly.dev/proxy/v1. This prevents the scenario where your retry logic has bugs or a runaway agent burns through budget before your application-level checks catch it.
What to implement today
- Add error type detection to your LLM calls. Distinguish rate limits from budget exhaustion.
-
Implement exponential backoff with jitter, respecting
retry-afterheaders. - Move to queue-based processing if you're doing any multi-request workflows.
- Set up monitoring for rate limit and budget exhaustion events. These should page someone.
The difference between a resilient AI agent and a fragile one isn't the model you use — it's how you handle the inevitable moment when the API says "no."
Top comments (0)