Synchronous batching is a throughput hack that became a design constraint. Hugging Face's latest work on asynchronous continuous batching shows why the distinction matters more than the batch size.
Most inference servers treat batching as a queuing problem. Requests pile up, you wait for N items or a timeout, then you process them together. This works until it doesn't—when your tail latency spikes because one long request blocks the entire batch, or when your GPU sits idle waiting for that last straggler to arrive.
The move to continuous batching helped. Instead of fixed windows, you could add and evict requests dynamically. But it was still fundamentally synchronous: every forward pass had to wait for the slowest sequence in the batch to complete its decode step. The GPU utilization looked good on dashboards, but the latency distribution told a different story.
The Async Shift
Asynchronous continuous batching decouples the scheduling loop from the forward pass. Requests enter a pool, the scheduler makes decisions about what to run, and the GPU executes independently. This sounds subtle but changes everything about how you think about inference throughput.
First, you can pipeline. While the GPU is working on step T, the scheduler is already preparing the batch for step T+1. The overhead doesn't disappear, but it overlaps with useful work. On modern GPUs with async copy engines, this matters more than most benchmarks capture.
Second, you can preempt. Not in the OS sense, but in the ability to yank a completed sequence from the batch mid-flight and replace it with a fresh one. The synchronous model forced you to wait for the entire batch to finish before anyone could leave. Async lets you maintain a full batch even when individual sequences have wildly different lengths.
Why This Matters for Agents
Agent workloads break traditional batching assumptions. Tool calls introduce non-deterministic latency. A request might pause for 500ms waiting for a search result, then resume with a burst of generation. Synchronous batching either holds the slot (wasting GPU memory) or evicts the request (paying recompute costs). Neither is acceptable at scale.
Async batching treats these pauses as first-class citizens. The request steps aside, the GPU keeps working on other sequences, and the scheduler brings it back when the tool responds. The memory stays allocated, but the compute doesn't stall.
This is particularly relevant for the emerging class of "always-on" agents that maintain long-running sessions. You can't batch these traditionally—they're perpetual. But you can interleave them with short-turnaround requests if your scheduler understands async completion.
The Implementation Reality
Hugging Face's TGI and vLLM have both moved toward async scheduling, though the implementations differ. TGI uses a dedicated scheduling thread that runs ahead of the GPU, while vLLM's recent iterations push more of the async logic into the CUDA graph itself. The tradeoffs are familiar: thread overhead versus kernel launch latency, complexity versus control.
What both approaches acknowledge is that the synchronous abstraction was a convenience, not a requirement. The hardware has been capable of async execution for years. The software is catching up.
The Takeaway
If you're running inference at scale, look at your tail latency percentiles, not your average throughput. If p99 is more than 3x your median, you're probably suffering from synchronous batching artifacts. Async continuous batching won't fix everything—memory bandwidth is still a bottleneck, and attention costs don't disappear—but it removes a class of scheduling-induced latency that has no business existing in 2026.
The best part: for many workloads, this is a software upgrade, not a hardware purchase. Your A100s or H100s get immediately more useful when the scheduler stops waiting for permission to work.
Top comments (0)