I burned my Anthropic org cap and waited 3 days. Then I built llmfleet.

#hermesagent #ai #llm #python

Tuesday afternoon I kicked off a re-grading job. About 18,000 prompts against claude-opus-4-7, eight workers, each one looping messages.create as fast as it could.

Forty minutes in, every call started coming back with a 429 and a header that said anthropic-ratelimit-tokens-remaining: 0. Fine, I thought. Back off. I cut workers to four and waited. Still 429. Cut to two. Still 429.

Then I noticed the cap-clear timestamp was not minutes. It was rolling. I had pushed past the daily token budget for the whole org, and a daily window does not reset in five minutes.

I emailed support. They acknowledged Wednesday morning. They cleared the cap Friday afternoon. 72 hours.

I am not going to claim the engineering was elegant after that. I sat there refreshing the dashboard for three days. When the cap finally cleared, I built llmfleet so I would never sit there again.

What it does

llmfleet is a pooled dispatcher for messages.create. You hand it a list of message payloads and a concurrency cap, and it runs them with backpressure that respects two things at once: in-flight request count, and the most recent anthropic-ratelimit-tokens-remaining header.

The Sandler-inspired piece is the negotiation. Instead of a hard semaphore, the pool watches what the API tells it. If the remaining-tokens header drops under a threshold, in-flight slots get held until the window ticks. No frantic 429 retries.

import asyncio
from llmfleet import Fleet

fleet = Fleet(
    api_key=os.environ["ANTHROPIC_API_KEY"],
    max_in_flight=8,
    soft_token_floor=20_000,   # pause new dispatches under this
    hard_token_floor=2_000,    # full stop until next window
)

payloads = [
    {"model": "claude-opus-4-7", "max_tokens": 256,
     "messages": [{"role": "user", "content": prompt}]}
    for prompt in prompts
]

async def run():
    async for result in fleet.dispatch(payloads):
        store(result.payload_id, result.response, result.cost_usd)

asyncio.run(run())

dispatch is an async iterator that yields results in completion order, not submission order. Each result has the original payload id, the response, latency in ms, and a cost estimate.

Real numbers I cite when people ask

On a single Anthropic key with no special quotas:

Messages/sec ceiling I see in practice for short prompts (about 400 input tokens, 200 output): around 6.2 req/s sustained before the soft floor kicks in.
Time spent waiting at the soft floor over a 10-minute window: about 11% of wall clock.
Time spent paused at the hard floor: zero, if you set soft_token_floor to about 10% of your tokens-per-minute quota. That is the whole point of the soft floor.

If you have higher tier quotas the numbers shift, but the shape is the same.

Queue depth math

The naive question is: how big should max_in_flight be?

Sandler's answer is a Little's Law calculation. If your average latency is L seconds and you want throughput R req/s, you need at least R*L concurrent calls in flight to saturate.

For Claude Opus with a 200-token output and typical 4-second responses at 6 req/s, that is 24 in-flight. But the Anthropic per-minute limit on most accounts will choke you before then. So the real max_in_flight is min(R*L, perminute_quota / 60 * L).

llmfleet does this math for you if you pass tier="default" or whatever your tier is. It logs the chosen ceiling at startup.

A small detail that mattered

The 429 retry that originally got me into this mess was not malicious. It was the SDK doing its default exponential backoff. Every worker was independently backing off and re-firing, which kept the cap pinned at zero for hours after the actual job was idle.

llmfleet disables the SDK's internal retry. The pool owns the retry budget. One shared count. When a single request fails non-retriably, the pool can decide whether to surface or move on, and the dispatcher logs the cost of the failed attempt so it does not disappear from your budget tracking.

fleet = Fleet(api_key=...,
              retry_policy=dict(max_attempts=3, base_delay=2.0, max_delay=30.0),
              shared_retry_budget_per_min=20)

Cost guard

I also added a hard USD cap because I do not trust myself at 2 AM.

fleet = Fleet(api_key=..., max_spend_usd=15.00)

When the running total crosses the cap, no new dispatches go out. In-flight ones still complete. The iterator yields a final BudgetExceeded marker and stops.

What this does not solve

It does not raise your account quota. Three days of waiting was a quota issue, not a code issue. llmfleet keeps you under the line, not over it.
It only talks to Anthropic right now. The interface mirrors messages.create exactly. I could generalize to OpenAI, but I have not yet.
It does not do prompt caching for you. If you want that, look at cachebench. The two compose: caching reduces the tokens you count against the floor.
It does not implement priority lanes. Every payload is FIFO. If you want one job to jump the queue, run two fleets.

The whole library is about 700 lines. The interesting part is the floor logic, not the queue.

Repo: https://github.com/MukundaKatta/llmfleet
PyPI: pip install llmfleet

Part of a small stack of agent-plumbing libs I keep building from real incidents. The unglamorous ones.