DEV Community

Indra Gusti Prasetya
Indra Gusti Prasetya

Posted on • Originally published at indragustiprasetya.com

Why B200 Token Cost Fell 5x in 2026: It's the Stack

Here is the number that should bother you: an engineer drove a 96-GPU B200 cluster to 1,103,941 tokens per second serving Qwen 3.5 27B, and the win that mattered most was not the silicon. Turn off one software feature, multi-token prediction, and throughput dropped by a third. Same chips, same model, same cluster. A third of the performance lived in a config flag.

We said earlier this year that tokens per watt, not FLOPS, would decide the 2026 GPU and cooling bill. We repeated the headline everyone repeats: served token cost on a B200-class node fell roughly 5x year over year. Then we left it there. We named the axis and never handed anyone the mechanism. So here it is, and the honest version is uncomfortable for anyone who just signed a Blackwell purchase order: the chip is maybe a third of the story. The rest is the serving stack.

The proof is public, and it's specific

The cleanest evidence landed on the Google Cloud community blog. An engineer ran a 12-node, 96-GPU B200 cluster serving Qwen 3.5 27B in FP8 on vLLM v0.18.0 and hit over 1.1 million tokens per second. The throughput is the flashy line. Ignore it. The line that decides budgets is the cost: 0.30 dollars per million tokens self-hosted on one-year committed-use pricing, against 0.67 dollars per million for a comparable hosted Flash-Lite API.

That is the whole argument in two numbers. Self-hosting on tuned open engines came in at less than half the hosted price. And the author is blunt about why. Multi-token prediction (MTP-1) was the single largest throughput lever, hitting a 90 percent acceptance rate and producing about 1.9 tokens per decode step. Switch it off and a third of the throughput vanishes, which means a third of your cost-per-token advantage vanishes with it.

So when someone tells you Blackwell cut inference cost 5x, the correct response is: Blackwell running what?

Multi-token prediction and its bigger cousin

MTP and speculative decoding are the same trick wearing different clothes. Normally a model produces one token per forward pass, and each pass is expensive. The idea behind both techniques is to guess several tokens cheaply, then verify them in a single pass of the big model. Accepted guesses are free throughput. Rejected ones cost you the verification you would have paid anyway.

AWS published P-EAGLE on March 13, 2026: parallel speculative decoding in vLLM v0.16.0 and later. On a single B200 serving GPT-OSS 20B it delivered up to 1.69x over vanilla EAGLE-3 at low concurrency, with acceptance length climbing from 3.03 to 3.94 tokens per round on HumanEval at speculation depth K=7.

Now the catch operators keep walking into. That 1.69x is a low-concurrency number. At concurrency 64 the speedup compresses to 1.05 to 1.25x. The reason is simple once you see it: speculative decoding spends idle compute to verify guesses, and at high batch sizes the GPU has no idle compute left. It is already saturated serving real requests. So the technique that looks spectacular in a single-stream demo can do almost nothing at your actual production batch size. Measure it where you run, not at c=1.

Larger models flip this in your favor. Ege Erdil's "Inference Economics of Language Models" (arXiv:2506.04645, June 2025) models speculative decoding at an 80 percent acceptance rate yielding a 66 percent throughput gain on Llama 3 70B and a doubling on Llama 3.1 405B at fixed cost per token. The bigger the target model, the more the cheap verification pass amortizes against it. If you serve a frontier-scale model, this is not a nice-to-have.

Prefix caching: the free win nobody benchmarks

SGLang's RadixAttention reuses the KV cache for shared prompt prefixes. Think about what your traffic actually looks like. A chat product sends the same system prompt on every turn. A RAG pipeline reuses the same retrieved context across a conversation. Most of those tokens are identical request to request, and a naive engine recomputes them every single time.

On prefix-heavy RAG pipelines the throughput delta over a cold engine runs several-fold. It costs nothing but enabling it. The reason teams miss it is structural: synthetic benchmarks fire unique prompts, so prefix caching shows zero benefit on the test and a large benefit in production. If you tune your stack against a benchmark with random prompts, you will leave this on the floor and never know it was there.

This is the cheapest hour of work in the entire stack. Do it first.

Prefill and decode were fighting on the same card

Here is the second-order effect most people never diagnose. Inference has two phases with opposite appetites. Prefill, processing the prompt, is compute-bound. Decode, generating tokens one step at a time, is memory-bandwidth-bound. Put them on the same GPU and they interfere: a big prefill stalls the decode stream, time to first token spikes, and your tail latency falls apart under load even though aggregate throughput looks fine.

Splitting them is worth roughly 2x. LMSYS's January 12, 2026 EPD writeup shows disaggregation roughly doubling throughput at higher request rates and cutting time to first token 6 to 8x under load. SGLang has published 2.7x higher decode throughput on GB200 NVL72 using the same split, with Mooncake or NIXL as the transfer backend.

The 6-to-8x TTFT improvement is the tell. That is not a throughput optimization, it is a latency rescue. If your p99 first-token latency degrades the moment traffic climbs while your decode numbers stay healthy, your prefill is starving your decode, and you can buy 2x before adding a single GPU.

Which engine, and does it even matter

It matters less than the feature set, which is the point most engine-comparison posts bury. On H100 at moderate concurrency SGLang leads vLLM by about 29 percent on standard workloads, roughly 16,200 versus 12,500 tokens per second, with TensorRT-LLM marginally ahead at high concurrency.

Twenty-nine percent is real money. But hold it next to the other numbers in this piece. MTP alone was worth a third. Disaggregation roughly 2x. The gap between two engines is smaller than the gap between one engine with the right features on and the same engine with them off. Picking SGLang over vLLM and then serving with defaults is optimizing the wrong variable.

The build-versus-buy argument just changed

For two years the case for paying an API premium was that providers held secret efficiency you couldn't reproduce. That case is gone. The efficiency is in vLLM and SGLang, both of which you can run yourself, and the economics paper makes the competitive logic plain: a provider that does not run speculative decoding cannot match the latency or the price of one that does. The moat is operational, not architectural. It is reproducible on your own cluster.

This is the line item that scales with everything you do. Inference runs 80 to 90 percent of AI compute spend, so a 3x swing in cost per token is not an optimization you slot into next quarter's roadmap. It is the budget.

One honest caveat, because the numbers demand it. Self-hosting crossed under hosted pricing (0.30 against 0.67) only after the stack was tuned. Buy B200 capacity, serve with defaults, and you can land north of the hosted price on hardware you own. The win is conditional. The condition is the work below.

Where to start this week

Work it in this order. Each step is tied to a specific number above, and the order is deliberate: cheapest and safest first.

  1. Measure your own cost per million tokens before touching hardware. Not throughput, cost. You are comparing against 0.67 dollars hosted and a tuned 0.30 self-hosted. If you don't have your own number, you can't tell whether your next move is a config change or a GPU order. Most teams discover the headroom is in the stack.

  2. Turn on prefix caching and continuous batching today. If you serve chat or RAG with shared system prompts, RadixAttention in SGLang is several-fold throughput for zero cost. This is the highest return per hour you will find. It won't show up on a synthetic benchmark, so validate it on replayed production traffic.

  3. Enable MTP or speculative decoding next, and watch acceptance rate, not headline speedup. Target above 70 percent. Below that, your draft model is wrong for your domain; swap or retrain it before you conclude the technique failed. Validate at your real batch size: remember P-EAGLE's 1.69x at low load collapsed toward 1.05x at concurrency 64.

  4. Reach for prefill-decode disaggregation when TTFT degrades under load while decode throughput stays fine. That exact signature means prefill is starving decode. Split them with SGLang plus Mooncake or NIXL and expect roughly 2x, and 6-to-8x better TTFT, before adding a GPU.

  5. Pin your versions to the feature you need. vLLM v0.16.0+ for P-EAGLE parallel speculative decoding, vLLM v0.18.0 with MTP for raw FP8 throughput, SGLang for RadixAttention and the most mature disaggregation. Pick for the capability, not the logo.

  6. Re-run build-versus-buy only with the tuned number in hand. At 0.30 against 0.67 the decision flips, but only after the stack is on. Don't concede the API premium until you've measured your own cost with these features enabled.

The B200 is necessary. It is nowhere near sufficient. Audit the stack before you sign for the next GPU.

Sources

Top comments (0)