Streaming AI Inference: The Software Fix That Cuts LLM Energy Bills

#webdev #devops #cloud #astro

The loudest part of the LLM energy conversation is about hardware: how many GPUs you need, which data center they sit in, what the power draw looks like on a busy Friday. That framing is incomplete. A 2026 paper from arxiv (2601.22362) found that purely through request arrival shaping — staggering when requests hit the server — energy per request dropped by up to 100x relative to a naive baseline, with no model changes and no new hardware. The model was identical. The GPU was identical. The scheduling was different.

That result is extreme and depends on a specific workload setup, but the direction it points is real: a large fraction of inference cost and energy comes from how work is scheduled, not just from the weight of the model. This article walks through the four software-side levers that matter most — continuous batching, KV-cache management, speculative decoding, and model routing — and the engineering tradeoffs each one forces you to make.

Continuous Batching: Stop Waiting for the Slowest Request

The static batching model that predates modern inference servers is simple: collect a group of requests, pad all their prompts to the same length, run them as one forward pass, return all results, repeat. The problem is the padding. If one request in your batch needs 2,000 output tokens and the others need 50, every GPU core assigned to that batch sits idle after token 50, burning power while waiting for the long request to finish. The waste scales with output-length variance, which is exactly what characterizes real traffic — you cannot predict who asks a short question and who asks for a 3,000-word essay.

Continuous batching (described in the Orca paper by Yu et al., 2022, and subsequently implemented in vLLM, TGI, and TensorRT-LLM) replaces batch-level scheduling with iteration-level scheduling. When a sequence in the batch finishes generating, the serving system immediately slots in a new request for the next forward pass rather than waiting for the entire batch to drain. The GPU stays busy on real tokens instead of padding tokens.

The throughput improvement this produces is significant enough that Stripe reportedly achieved a 73% inference cost reduction when migrating to vLLM for a workload running around 50 million daily API calls. That figure comes from third-party reporting and covers a specific migration, so treat it as directional rather than a universal benchmark. What you can say with confidence: for workloads with variable output lengths, continuous batching is the single highest-leverage software change available, and almost every production serving framework now ships it by default.

The tradeoff is latency predictability. When the system is aggressively refilling batches, a new request arriving during a long decode sequence gets queued behind in-flight tokens. Tuning the preemption and priority policies becomes necessary once you care about p95 latency, not just average throughput.

KV-Cache Management: Memory Is the Actual Bottleneck

During autoregressive generation, each new token attends over every previous token. The key and value projections for those previous tokens — the KV cache — get recomputed from scratch every forward pass unless you save them. Caching them is the obvious move, but the memory cost scales linearly with context length, and a naive allocator reserves a contiguous block for the maximum possible sequence length at request start. On long-context workloads, this means 60–80% of your KV-cache memory can be sitting unused, reserved but not touched, blocking other requests from starting.

PagedAttention, developed at UC Berkeley and shipped in vLLM, applies the same insight as OS virtual memory: allocate KV cache in fixed-size pages and only map pages that are actually in use. Pages for a sequence are allocated incrementally as tokens are generated; only the last partial page wastes space. This shrinks effective KV-cache footprint substantially, which lets more requests run concurrently on the same GPU RAM.

The decode phase of transformer inference is memory-bandwidth-bound, not compute-bound. Each generated token reads all cached key/value pairs from GPU HBM. Techniques that reduce KV-cache size — pruning, quantizing the cache to int8 or fp8, or evicting distant context — cut the memory reads per token and directly reduce time-to-first-token on subsequent requests sharing the same GPU. Lower memory pressure also means the serving system can run larger batches without swapping to CPU, which keeps the GPU busy on actual compute.

Recent research (arxiv 2603.20397) surveys a range of KV-cache optimization strategies beyond PagedAttention: selective eviction of tokens whose attention scores fall below a threshold, quantizing the cache to lower precision than the weights, and sharing cache blocks across requests that share a prefix (useful for system prompts that repeat across every call). The paper's conclusion is that no single technique wins across all settings — the optimal combination depends on your context length distribution, hardware memory bandwidth, and latency SLO. That means you need to profile your actual traffic rather than copy a configuration from a benchmark.

Speculative Decoding: Fill GPU Compute You Are Already Paying For

Standard autoregressive generation has a structural inefficiency: the GPU executes one forward pass per output token, but the forward pass for a single token uses a tiny fraction of available compute. The GPU is massively parallel hardware being asked to do a sequential job.

Speculative decoding breaks the sequentiality by using a small, fast draft model to propose multiple tokens ahead, which the full target model then verifies in a single parallel forward pass. If the draft tokens match what the target model would have produced, you get several tokens for the cost of one target-model pass. If some tokens are rejected, you fall back to the first rejection point and continue — output quality is identical to running the target model alone, because the verification step guarantees it.

The practical speedup in production has been documented at 2–3x for latency-sensitive deployments, with NVIDIA reporting up to 3.6x throughput improvement on H200 hardware with appropriate draft model selection. The energy picture is more nuanced. Research published on arxiv (2602.09113) benchmarking speculative decoding energy found that at small batch sizes, speculative decoding can reduce total energy by around 29%, because finishing requests faster lets the GPU return to a lower-power state. At large batch sizes, the overhead of running the draft model and the verification pass can increase total energy even while reducing wall-clock latency.

The practical implication: speculative decoding is most beneficial when you are latency-constrained and batch sizes are modest — interactive chatbots, real-time coding assistants. For high-throughput batch processing where you are filling the GPU anyway, the gain narrows and can invert.

Model Routing: Most Requests Do Not Need Your Best Model

The fourth lever does not require any changes to inference infrastructure — it requires routing logic in front of your inference stack. Most production workloads contain a mix of query complexity. Factual lookups, JSON extraction from structured inputs, short classification tasks, and template fills are handled competently by models several tiers below your frontier model. Routing those requests to a smaller model costs proportionally less compute and energy.

The engineering challenge is that you do not know which requests are simple until after you have answered them. Router approaches fall into two categories. Cascade routing runs the small model first and escalates to the large model if the small model's confidence is below a threshold — this adds latency for the escalated fraction. Direct routing uses a lightweight classifier on the input to predict difficulty and pick a model upfront — faster but the classifier can misroute.

A 2026 survey (arxiv 2603.04445) on dynamic model routing found that in production NER workloads, cascade routing reduced inference cost by around 31% at comparable accuracy. A separate calibrated uncertainty routing approach (UCCI, arxiv 2605.18796) also targeted roughly 31% cost reduction on the same task class. The recurring pattern across papers: you can expect material savings on mixed workloads, but the savings are sensitive to how well your router predicts query difficulty, and a miscalibrated router that over-routes to the large model captures little benefit.

One underappreciated angle: the routing decision interacts with batching. If your router sends small-model traffic to one serving endpoint and large-model traffic to another, each endpoint sees a more homogeneous workload, which makes batch formation more efficient. Mixing model sizes in a single serving pool complicates continuous batching because different model sizes have different memory footprints and latency profiles.

Putting It Together

These techniques compound. Continuous batching raises GPU utilization from whatever baseline you start at. PagedAttention reduces the memory pressure that limits how many requests fit in a batch. Speculative decoding cuts latency per token when batches are small. Model routing shifts a fraction of traffic to cheaper serving endpoints. Together they address the four main sources of inference waste: idle compute between requests, wasted memory from fragmented allocation, sequential compute during generation, and over-provisioned model capacity for simple queries.

None of them are free configuration changes. Continuous batching requires a serving framework that supports iteration-level scheduling. PagedAttention is built into vLLM and SGLang but requires you to manage page eviction policies under memory pressure. Speculative decoding requires a draft model that is fast enough to make the proposal step cheap — draft models are typically 7B parameters or smaller when serving a 70B+ target. Model routing requires labeled evaluation data to validate that the router is not quietly degrading output quality on escalated queries.

The energy story the industry tends to tell focuses on hardware — more efficient chips, better cooling, renewable power. Those matter. But the software-side optimizations described here are available today, on hardware you already run, and the efficiency gap between a naive serving setup and a well-tuned one is not marginal. It is the difference between treating GPU time as a fixed cost and treating it as something you can actually engineer.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.