Mukunda Rao Katta

Posted on May 25

llmfleet: pool many agents' turns into one Batch API call and save 50 percent

#hermeschallenge #ai #llm #agents

Anthropic's Batch API saves 50% on input tokens. I have a hard time thinking of a feature with a better cost-to-effort ratio. And almost none of the agents I have built actually use it, because the docs make it look like a tool for offline processing and the SDK shapes it as a one-shot request that polls for 90 to 120 seconds.

That framing is wrong for one user. It is correct for a fleet.

If you are running many agents in parallel, each on its own coroutine, each making its own messages.create call, then the right unit of batching is not one user's turn. It is the fleet's turns, pooled together by a layer the user never sees, flushed on a window, and routed back to the awaiting coroutine via a Future. Eran Sandler wrote up why batch is terrible for one agent earlier this year. llmfleet is the inversion: it is what you do when "one agent" turns into "twenty agents."

The problem

The Batch API is shaped wrong for the SDK call site. You either:

Pretend it does not exist and pay full input cost.
Hand-roll a small queue per project, get the polling logic wrong, miss the cost savings on half the requests anyway, and now you have a queue to maintain.
Push everything through Batch and hate your life when the interactive chat path takes 90 seconds.

What you actually want is a routing decision per call: this one needs to come back fast, send it sync. This one can wait, pool it. Both paths look the same to the caller. The dispatcher figures out which is which.

The shape of the fix

The caller passes a latency_budget_ms and the dispatcher routes.

import asyncio
from anthropic import AsyncAnthropic
from llmfleet import FleetDispatcher, RoutingPolicy

async def main():
    client = AsyncAnthropic()
    policy = RoutingPolicy(
        sync_max_latency_ms=5_000,    # interactive paths stay sync
        batch_window_ms=30_000,       # otherwise pool for 30s
        batch_min_size=10,
        batch_max_size=100,
    )

    async with FleetDispatcher(client, policy=policy) as fleet:
        # Tight latency budget. Routed sync.
        chat = await fleet.submit(
            latency_budget_ms=2_000,
            model="claude-sonnet-4-20250514",
            max_tokens=200,
            messages=[{"role": "user", "content": "Hi"}],
        )

        # Loose latency budget. Pooled into a batch.
        graded = await asyncio.gather(*[
            fleet.submit(
                latency_budget_ms=600_000,
                model="claude-sonnet-4-20250514",
                max_tokens=200,
                messages=[{"role": "user", "content": f"Grade: {essay}"}],
            )
            for essay in essays
        ])

asyncio.run(main())

The interesting line is await fleet.submit(latency_budget_ms=...). The caller does not care whether the call went sync or batched. It just awaits a response. The dispatcher resolved the routing.

If you want to force a route:

async def force_routes(fleet, kwargs):
    await fleet.submit_sync(**kwargs)
    await fleet.submit_batch(**kwargs)

If you want to know what got pooled and what did not:

print(fleet.stats.sync_calls)
print(fleet.stats.batched_calls)
print(fleet.stats.batches_submitted)

That is the whole API.

What it does NOT do

It does not route across providers or models based on quality. Use a real router for that.
It does not do cross-process pooling. Fleet is process-local. If you need cross-process, put Redis or SQS in front.
It does not try to pool tool-call turns where the tool is on the critical path. Pass force_sync=True (via submit_sync) for those.
It does not retry failed batches by default. If a batch errors out, you get the error per submission.

Inside the lib (one design choice worth showing)

The dispatcher runs a background flusher coroutine. Calls to submit() either run synchronously, or get queued and resolved by the flusher.

# pseudo-shape of the flush decision
async def _flusher(self):
    while not self._stopped:
        await asyncio.sleep(self.poll_interval_s)
        now = monotonic()
        ready = (
            len(self._queue) >= self.policy.batch_min_size
            or (self._queue and now - self._oldest_queued_at >= self.policy.batch_window_ms / 1000)
        )
        if ready:
            await self._flush_one_batch()

Two thresholds, one race. The flusher fires when either:

The queue has filled to batch_min_size, or
The oldest queued item has been waiting batch_window_ms.

Whichever wins, wins. This is the bit that makes the fleet share batches across independent coroutines without anyone having to coordinate. Each submit() call just appends to the queue and awaits its Future. The flusher resolves the Future when the batch comes back.

The _oldest_queued_at check is what protects the slow path from starvation when traffic is bursty. Without it, a quiet hour after a busy one would leave items in the queue waiting for the next burst to push the count over batch_min_size. With it, every item is guaranteed to be flushed within batch_window_ms of arrival.

When this is useful

You run many concurrent agents and most of them are doing offline work like grading, summarization, or extraction.
You are running a leaderboard or eval where 1,000 generations need to complete in the next hour, not the next second.
You have a chat path and a background path in the same process and want one client for both.
Your input tokens are large (long system prompts, RAG context) and the 50% discount is meaningful.
You want the routing decision to be a parameter, not a separate code path.

When this is NOT what you want

You are running one agent. Use the sync API. Batch is the wrong shape for one user.
You need sub-second p99 latency on every call. Even with sync routing, the dispatcher adds a tiny overhead.
You need cross-process pooling. llmfleet is in-process only.

Install

pip install llmfleet

Repo: https://github.com/MukundaKatta/llmfleet

Sibling libraries

Library	Role
cachebench	Per-call cache hit ratio + cost saved
agenttap	Wire-level prompt introspection
token-budget-py	Shared token/USD budget pool across coroutines
agenttrace	Run-level cost + latency aggregation
llm-retry	Exponential backoff for LLM calls

cachebench and llmfleet stack well: pool patient calls into batches, then track cache hit ratio per call so you know when a batch silently missed the cache.

What's next

v0.1.0 is Anthropic-only. OpenAI Batch API and Bedrock async-invoke are on the roadmap. The dispatcher is provider-agnostic; the per-provider client adapter is the missing piece for each new vendor. PRs welcome.

If you are running more than a handful of concurrent agents and not using Batch, the cheapest 50% you will save this quarter is sitting right there.

DEV Community