Anthropic's Batch API saves 50% on input tokens. I have a hard time thinking of a feature with a better cost-to-effort ratio. And almost none of the agents I have built actually use it, because the docs make it look like a tool for offline processing and the SDK shapes it as a one-shot request that polls for 90 to 120 seconds.
That framing is wrong for one user. It is correct for a fleet.
If you are running many agents in parallel, each on its own coroutine, each making its own messages.create call, then the right unit of batching is not one user's turn. It is the fleet's turns, pooled together by a layer the user never sees, flushed on a window, and routed back to the awaiting coroutine via a Future. Eran Sandler wrote up why batch is terrible for one agent earlier this year. llmfleet is the inversion: it is what you do when "one agent" turns into "twenty agents."
The problem
The Batch API is shaped wrong for the SDK call site. You either:
- Pretend it does not exist and pay full input cost.
- Hand-roll a small queue per project, get the polling logic wrong, miss the cost savings on half the requests anyway, and now you have a queue to maintain.
- Push everything through Batch and hate your life when the interactive chat path takes 90 seconds.
What you actually want is a routing decision per call: this one needs to come back fast, send it sync. This one can wait, pool it. Both paths look the same to the caller. The dispatcher figures out which is which.
The shape of the fix
The caller passes a latency_budget_ms and the dispatcher routes.
import asyncio
from anthropic import AsyncAnthropic
from llmfleet import FleetDispatcher, RoutingPolicy
async def main():
client = AsyncAnthropic()
policy = RoutingPolicy(
sync_max_latency_ms=5_000, # interactive paths stay sync
batch_window_ms=30_000, # otherwise pool for 30s
batch_min_size=10,
batch_max_size=100,
)
async with FleetDispatcher(client, policy=policy) as fleet:
# Tight latency budget. Routed sync.
chat = await fleet.submit(
latency_budget_ms=2_000,
model="claude-sonnet-4-20250514",
max_tokens=200,
messages=[{"role": "user", "content": "Hi"}],
)
# Loose latency budget. Pooled into a batch.
graded = await asyncio.gather(*[
fleet.submit(
latency_budget_ms=600_000,
model="claude-sonnet-4-20250514",
max_tokens=200,
messages=[{"role": "user", "content": f"Grade: {essay}"}],
)
for essay in essays
])
asyncio.run(main())
The interesting line is await fleet.submit(latency_budget_ms=...). The caller does not care whether the call went sync or batched. It just awaits a response. The dispatcher resolved the routing.
If you want to force a route:
async def force_routes(fleet, kwargs):
await fleet.submit_sync(**kwargs)
await fleet.submit_batch(**kwargs)
If you want to know what got pooled and what did not:
print(fleet.stats.sync_calls)
print(fleet.stats.batched_calls)
print(fleet.stats.batches_submitted)
That is the whole API.
What it does NOT do
- It does not route across providers or models based on quality. Use a real router for that.
- It does not do cross-process pooling. Fleet is process-local. If you need cross-process, put Redis or SQS in front.
- It does not try to pool tool-call turns where the tool is on the critical path. Pass
force_sync=True(viasubmit_sync) for those. - It does not retry failed batches by default. If a batch errors out, you get the error per submission.
Inside the lib (one design choice worth showing)
The dispatcher runs a background flusher coroutine. Calls to submit() either run synchronously, or get queued and resolved by the flusher.
# pseudo-shape of the flush decision
async def _flusher(self):
while not self._stopped:
await asyncio.sleep(self.poll_interval_s)
now = monotonic()
ready = (
len(self._queue) >= self.policy.batch_min_size
or (self._queue and now - self._oldest_queued_at >= self.policy.batch_window_ms / 1000)
)
if ready:
await self._flush_one_batch()
Two thresholds, one race. The flusher fires when either:
- The queue has filled to
batch_min_size, or - The oldest queued item has been waiting
batch_window_ms.
Whichever wins, wins. This is the bit that makes the fleet share batches across independent coroutines without anyone having to coordinate. Each submit() call just appends to the queue and awaits its Future. The flusher resolves the Future when the batch comes back.
The _oldest_queued_at check is what protects the slow path from starvation when traffic is bursty. Without it, a quiet hour after a busy one would leave items in the queue waiting for the next burst to push the count over batch_min_size. With it, every item is guaranteed to be flushed within batch_window_ms of arrival.
When this is useful
- You run many concurrent agents and most of them are doing offline work like grading, summarization, or extraction.
- You are running a leaderboard or eval where 1,000 generations need to complete in the next hour, not the next second.
- You have a chat path and a background path in the same process and want one client for both.
- Your input tokens are large (long system prompts, RAG context) and the 50% discount is meaningful.
- You want the routing decision to be a parameter, not a separate code path.
When this is NOT what you want
- You are running one agent. Use the sync API. Batch is the wrong shape for one user.
- You need sub-second p99 latency on every call. Even with sync routing, the dispatcher adds a tiny overhead.
- You need cross-process pooling. llmfleet is in-process only.
Install
pip install llmfleet
Repo: https://github.com/MukundaKatta/llmfleet
Sibling libraries
| Library | Role |
|---|---|
| cachebench | Per-call cache hit ratio + cost saved |
| agenttap | Wire-level prompt introspection |
| token-budget-py | Shared token/USD budget pool across coroutines |
| agenttrace | Run-level cost + latency aggregation |
| llm-retry | Exponential backoff for LLM calls |
cachebench and llmfleet stack well: pool patient calls into batches, then track cache hit ratio per call so you know when a batch silently missed the cache.
What's next
v0.1.0 is Anthropic-only. OpenAI Batch API and Bedrock async-invoke are on the roadmap. The dispatcher is provider-agnostic; the per-provider client adapter is the missing piece for each new vendor. PRs welcome.
If you are running more than a handful of concurrent agents and not using Batch, the cheapest 50% you will save this quarter is sitting right there.
Top comments (0)