Rate limits are not API problems. They're fleet architecture problems.
When you're running a single agent, a rate limit is an inconvenience. When you're running a fleet — agents doing site auditing, content publishing, monitoring, data extraction — at 2am, a rate limit is a reliability failure that compounds across every dependent task.
I've been sharing AN Score data on agent-facing APIs, and the most consistent feedback from real fleet operators is this: it's not the capability that breaks you, it's the failure behavior.
So this post is about designing fleets around rate limits — not fighting them, designing with them.
The Hierarchy of Rate Limit Quality
Before you design your fleet architecture, you need to know what you're working with. APIs break into three tiers based on rate limit handling quality:
Tier 1: Actionable Rate Limits (AN Score 7.5+)
These APIs tell you exactly what's wrong and how long to wait:
-
Retry-After: 30in the response header - Error body:
{"error": "rate_limit_exceeded", "retry_after": 30, "limit_type": "tokens_per_minute"} - Separate codes for different limit types (tokens vs requests vs concurrency)
Anthropic (8.4), Stripe (8.1), Twilio (8.0), Exa (8.7), Tavily (8.6) all land here.
With Tier 1 APIs, your backoff logic can be precise: parse the Retry-After, branch on limit_type, and schedule the retry exactly when the window opens.
Tier 2: Informative Rate Limits (AN Score 6.0–7.4)
These APIs surface rate limit information, but inconsistently or incompletely:
- Rate limit headers present but not always populated
- Error codes exist but don't distinguish between limit types
- Retry guidance sometimes in headers, sometimes in body, sometimes absent
OpenAI (6.3), Vercel (7.1), PostHog (6.9) fall here.
With Tier 2 APIs, you need defensive logic: check for headers first, fall back to exponential backoff with jitter when headers are absent, build in overshoot margin.
Tier 3: Opaque Rate Limits (AN Score < 6.0)
These APIs make rate limiting invisible:
- Generic HTTP 429 with no timing guidance
- Natural language error messages that can't be parsed programmatically
- No distinction between rate limits, quota exhaustion, and auth failures
HubSpot (4.6), Salesforce (4.8), Pipedrive (5.7) live here.
With Tier 3 APIs, your agent can't distinguish "wait 30 seconds" from "you've hit your monthly quota" from "your auth token expired." You need rate sensing built into your architecture, not just your backoff logic.
Fleet Architecture Patterns
Pattern 1: Per-Agent Rate Budget Allocation
Don't share rate budgets across agents naively. Each agent should have a defined budget for the APIs it touches.
Fleet: 10 agents running content publishing
Anthropic limit: 1000 RPM across the account
Bad: Each agent tries to use up to 1000 RPM (contention + 429 storms)
Good: Allocate 100 RPM per agent, enforce via token bucket in orchestration layer
Token buckets per agent rather than shared pools is the right abstraction. Shared pools collapse under burst load — when three agents hit their tasks simultaneously, they compete for the same rate budget and all three slow down.
Per-agent allocation makes failure behavior predictable and isolated.
Pattern 2: Exponential Backoff with Jitter
Fixed delays are a trap. If 10 agents all hit a rate limit at the same time and all wait 30 seconds, they'll all retry at the same time and immediately hit the limit again.
The standard pattern:
import random
def backoff_delay(attempt, base=1.0, max_delay=60.0):
# Exponential with full jitter
delay = min(base * (2 ** attempt), max_delay)
return random.uniform(0, delay)
For Tier 1 APIs, use the Retry-After header as a floor, not a ceiling:
retry_after = int(response.headers.get('Retry-After', 0))
delay = max(retry_after, backoff_delay(attempt))
This is the difference between agents that recover and agents that spin.
Pattern 3: Time-Domain Multiplexing for Scheduled Fleets
If your agents run on schedules, stagger them. Agents doing similar work shouldn't all wake up at the same time.
Bad: 10 monitoring agents, all run every 15 minutes on the :00
Good: 10 monitoring agents, offset by 90 seconds each
(agent 1 at :00, agent 2 at :01:30, agent 3 at :03:00, ...)
The rate limit math: if each agent makes 10 API calls in its first minute, running all 10 simultaneously means 100 calls in the first minute. Staggering them across 15 minutes means 10 calls/minute — well within any rate budget.
This requires nothing from the API. It's pure fleet scheduling.
Pattern 4: Dynamic Rate Limit Discovery for Tier 2/3 APIs
For APIs that don't reliably surface rate limit state:
- Track response latency over time. APIs approaching rate limits often slow down before returning 429s. A sudden P95 latency increase is an early warning.
- Monitor X-RateLimit-Remaining headers when present and build adaptive throttling based on remaining headroom.
- Log all 4xx responses with timing. Multiple 429s in a 60-second window = you're at the limit. Single 429 every few hours = bursty but not capped.
For HubSpot and Salesforce specifically — build a conservative rate governor at the orchestration layer. Don't trust the API to tell you when to slow down.
Handling Dynamic Rate Limits Under Load
Some APIs (Anthropic included) dynamically adjust rate limits under sustained high load, meaning your static rate budget allocation can become invalid mid-run.
The pattern that works:
- Monitor X-RateLimit-Remaining in every response, not just on 429s
- Implement adaptive throttling: if remaining drops below 20% of limit, slow request rate by 50% preemptively
- Build a limit discovery step at fleet startup: make a test call and record the headers to get the current effective limit, not the documented limit
def get_effective_limit(client):
# Make lightweight probe call
resp = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1,
messages=[{"role": "user", "content": "ping"}]
)
return {
'requests_per_minute': int(resp.headers.get('anthropic-ratelimit-requests-limit', 0)),
'tokens_per_minute': int(resp.headers.get('anthropic-ratelimit-tokens-limit', 0)),
'remaining_tokens': int(resp.headers.get('anthropic-ratelimit-tokens-remaining', 0))
}
Anthropic exposes these headers consistently — part of why it scores 8.4 on execution. You can build real adaptive behavior on top of them.
The 2am Checklist
Before you deploy a fleet for overnight runs:
- [ ] Can each agent distinguish a rate limit from an auth failure from a downstream error? If not, you'll have phantom failures that look like rate limits.
- [ ] Does your backoff logic use jitter? Fixed delays cause thundering herds.
- [ ] Are your agents isolated or do they share rate budget? Shared budgets collapse under concurrent load.
- [ ] For Tier 3 APIs (HubSpot, Salesforce): do you have a rate governor at the orchestration layer? You can't rely on the API to tell you when to slow down.
- [ ] Do you have retry limits with clean failure states? Infinite retry loops mask the real problem. Set a max retry count, let the task fail cleanly with a detailed error log.
- [ ] Are scheduled agents staggered? Time-domain multiplexing costs nothing and prevents burst contention.
Fleet reliability isn't about making APIs faster. It's about making your agents fail gracefully when APIs are slow.
The AN Score Execution Dimension
The execution dimension (70% of AN Score) directly measures these qualities:
- Error classification — does the API distinguish between rate limits, quota exhaustion, auth failures, and transient errors?
- Retry guidance — does the API tell your agent how long to wait?
- Structured errors — are error responses machine-parseable?
- Idempotency support — can your agent safely retry without duplicate side effects?
The gap between Anthropic (8.4) and HubSpot (4.6) isn't just capability — it's the difference between a fleet that self-heals at 2am and one that requires a human to debug in the morning.
AN Score execution data: Anthropic 8.4, Exa 8.7, Tavily 8.6, Twilio 8.0, Stripe 8.1, OpenAI 6.3, HubSpot 4.6, Salesforce 4.8. Full index: rhumb.dev
Part of the agent infrastructure series: LLM APIs comparison | What breaks at scale
Top comments (0)