DEV Community

Rhumb
Rhumb

Posted on • Originally published at rhumb.dev

Designing Agent Fleets That Survive Rate Limits: A Production Architecture Guide

Rate limits are not API problems. They're fleet architecture problems.

When you're running a single agent, a rate limit is an inconvenience. When you're running a fleet — agents doing site auditing, content publishing, monitoring, data extraction — at 2am, a rate limit is a reliability failure that compounds across every dependent task.

I've been sharing AN Score data on agent-facing APIs, and the most consistent feedback from real fleet operators is this: it's not the capability that breaks you, it's the failure behavior.

So this post is about designing fleets around rate limits — not fighting them, designing with them.


The Hierarchy of Rate Limit Quality

Before you design your fleet architecture, you need to know what you're working with. APIs break into three tiers based on rate limit handling quality:

Tier 1: Actionable Rate Limits (AN Score 7.5+)

These APIs tell you exactly what's wrong and how long to wait:

  • Retry-After: 30 in the response header
  • Error body: {"error": "rate_limit_exceeded", "retry_after": 30, "limit_type": "tokens_per_minute"}
  • Separate codes for different limit types (tokens vs requests vs concurrency)

Anthropic (8.4), Stripe (8.1), Twilio (8.0), Exa (8.7), Tavily (8.6) all land here.

With Tier 1 APIs, your backoff logic can be precise: parse the Retry-After, branch on limit_type, and schedule the retry exactly when the window opens.

Tier 2: Informative Rate Limits (AN Score 6.0–7.4)

These APIs surface rate limit information, but inconsistently or incompletely:

  • Rate limit headers present but not always populated
  • Error codes exist but don't distinguish between limit types
  • Retry guidance sometimes in headers, sometimes in body, sometimes absent

OpenAI (6.3), Vercel (7.1), PostHog (6.9) fall here.

With Tier 2 APIs, you need defensive logic: check for headers first, fall back to exponential backoff with jitter when headers are absent, build in overshoot margin.

Tier 3: Opaque Rate Limits (AN Score < 6.0)

These APIs make rate limiting invisible:

  • Generic HTTP 429 with no timing guidance
  • Natural language error messages that can't be parsed programmatically
  • No distinction between rate limits, quota exhaustion, and auth failures

HubSpot (4.6), Salesforce (4.8), Pipedrive (5.7) live here.

With Tier 3 APIs, your agent can't distinguish "wait 30 seconds" from "you've hit your monthly quota" from "your auth token expired." You need rate sensing built into your architecture, not just your backoff logic.


Fleet Architecture Patterns

Pattern 1: Per-Agent Rate Budget Allocation

Don't share rate budgets across agents naively. Each agent should have a defined budget for the APIs it touches.

Fleet: 10 agents running content publishing
Anthropic limit: 1000 RPM across the account

Bad: Each agent tries to use up to 1000 RPM (contention + 429 storms)
Good: Allocate 100 RPM per agent, enforce via token bucket in orchestration layer
Enter fullscreen mode Exit fullscreen mode

Token buckets per agent rather than shared pools is the right abstraction. Shared pools collapse under burst load — when three agents hit their tasks simultaneously, they compete for the same rate budget and all three slow down.

Per-agent allocation makes failure behavior predictable and isolated.

Pattern 2: Exponential Backoff with Jitter

Fixed delays are a trap. If 10 agents all hit a rate limit at the same time and all wait 30 seconds, they'll all retry at the same time and immediately hit the limit again.

The standard pattern:

import random

def backoff_delay(attempt, base=1.0, max_delay=60.0):
    # Exponential with full jitter
    delay = min(base * (2 ** attempt), max_delay)
    return random.uniform(0, delay)
Enter fullscreen mode Exit fullscreen mode

For Tier 1 APIs, use the Retry-After header as a floor, not a ceiling:

retry_after = int(response.headers.get('Retry-After', 0))
delay = max(retry_after, backoff_delay(attempt))
Enter fullscreen mode Exit fullscreen mode

This is the difference between agents that recover and agents that spin.

Pattern 3: Time-Domain Multiplexing for Scheduled Fleets

If your agents run on schedules, stagger them. Agents doing similar work shouldn't all wake up at the same time.

Bad: 10 monitoring agents, all run every 15 minutes on the :00
Good: 10 monitoring agents, offset by 90 seconds each
      (agent 1 at :00, agent 2 at :01:30, agent 3 at :03:00, ...)
Enter fullscreen mode Exit fullscreen mode

The rate limit math: if each agent makes 10 API calls in its first minute, running all 10 simultaneously means 100 calls in the first minute. Staggering them across 15 minutes means 10 calls/minute — well within any rate budget.

This requires nothing from the API. It's pure fleet scheduling.

Pattern 4: Dynamic Rate Limit Discovery for Tier 2/3 APIs

For APIs that don't reliably surface rate limit state:

  1. Track response latency over time. APIs approaching rate limits often slow down before returning 429s. A sudden P95 latency increase is an early warning.
  2. Monitor X-RateLimit-Remaining headers when present and build adaptive throttling based on remaining headroom.
  3. Log all 4xx responses with timing. Multiple 429s in a 60-second window = you're at the limit. Single 429 every few hours = bursty but not capped.

For HubSpot and Salesforce specifically — build a conservative rate governor at the orchestration layer. Don't trust the API to tell you when to slow down.


Handling Dynamic Rate Limits Under Load

Some APIs (Anthropic included) dynamically adjust rate limits under sustained high load, meaning your static rate budget allocation can become invalid mid-run.

The pattern that works:

  1. Monitor X-RateLimit-Remaining in every response, not just on 429s
  2. Implement adaptive throttling: if remaining drops below 20% of limit, slow request rate by 50% preemptively
  3. Build a limit discovery step at fleet startup: make a test call and record the headers to get the current effective limit, not the documented limit
def get_effective_limit(client):
    # Make lightweight probe call
    resp = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1,
        messages=[{"role": "user", "content": "ping"}]
    )
    return {
        'requests_per_minute': int(resp.headers.get('anthropic-ratelimit-requests-limit', 0)),
        'tokens_per_minute': int(resp.headers.get('anthropic-ratelimit-tokens-limit', 0)),
        'remaining_tokens': int(resp.headers.get('anthropic-ratelimit-tokens-remaining', 0))
    }
Enter fullscreen mode Exit fullscreen mode

Anthropic exposes these headers consistently — part of why it scores 8.4 on execution. You can build real adaptive behavior on top of them.


The 2am Checklist

Before you deploy a fleet for overnight runs:

  • [ ] Can each agent distinguish a rate limit from an auth failure from a downstream error? If not, you'll have phantom failures that look like rate limits.
  • [ ] Does your backoff logic use jitter? Fixed delays cause thundering herds.
  • [ ] Are your agents isolated or do they share rate budget? Shared budgets collapse under concurrent load.
  • [ ] For Tier 3 APIs (HubSpot, Salesforce): do you have a rate governor at the orchestration layer? You can't rely on the API to tell you when to slow down.
  • [ ] Do you have retry limits with clean failure states? Infinite retry loops mask the real problem. Set a max retry count, let the task fail cleanly with a detailed error log.
  • [ ] Are scheduled agents staggered? Time-domain multiplexing costs nothing and prevents burst contention.

Fleet reliability isn't about making APIs faster. It's about making your agents fail gracefully when APIs are slow.


The AN Score Execution Dimension

The execution dimension (70% of AN Score) directly measures these qualities:

  • Error classification — does the API distinguish between rate limits, quota exhaustion, auth failures, and transient errors?
  • Retry guidance — does the API tell your agent how long to wait?
  • Structured errors — are error responses machine-parseable?
  • Idempotency support — can your agent safely retry without duplicate side effects?

The gap between Anthropic (8.4) and HubSpot (4.6) isn't just capability — it's the difference between a fleet that self-heals at 2am and one that requires a human to debug in the morning.


AN Score execution data: Anthropic 8.4, Exa 8.7, Tavily 8.6, Twilio 8.0, Stripe 8.1, OpenAI 6.3, HubSpot 4.6, Salesforce 4.8. Full index: rhumb.dev

Part of the agent infrastructure series: LLM APIs comparison | What breaks at scale

Top comments (0)