Tokens per Watt Decides Your 2026 GPU and Cooling

#ai #infrastructure #llm #performance

A single B200 went from costing about 11 cents per million tokens at launch to 2 cents two months later, with no hardware change. Same silicon, same rack, same power draw. The only thing that moved was the serving stack. If your internal chargeback model was set before that happened, you are billing your own product at five times what it actually costs to run, and nobody told finance.

I have spent the last year watching platform teams budget AI on the wrong axis. They negotiate the GPU contract on dollars per hour, put peak FLOPS on a slide in the architecture review, and then act surprised when the unit economics of a shipped feature come back underwater. The number that decides whether your inference is solvent is not FLOPS and not the rack lease. It is tokens per watt, and the cost per million tokens that falls out of it. In 2026 those two quietly became the whole decision.

The workload flipped, and so did the constraint

Training owned the narrative for three years. It is over as the cost center. Inference now runs roughly 80 to 90 percent of AI compute spend, and inference behaves like a power problem long before it behaves like a silicon problem.

Here is the part that bites teams: almost every serious data center is power-capped, not space-capped or even budget-capped in the short term. You have a megawatt allocation from the utility, it is fixed, and the grid queue to get more is measured in years. Once you accept that, the GPU question inverts. You are not buying the fastest chip. You are buying the chip that converts a fixed megawatt into the most sellable tokens. Throughput per megawatt is the product. Everything else is vanity.

The benchmarks finally admit this. NVIDIA's October 2025 InferenceMAX results, run on the open-source SemiAnalysis harness, lead with cost per token and throughput per megawatt instead of TFLOPS. When the vendor stops bragging about FLOPS and starts reporting the metric your CFO already tracks, the framing war is settled.

Blackwell versus Hopper is an efficiency gap, not a speed gap

The marketing wants you to read Blackwell as "faster." Read it as "more tokens per watt," because that is the number that matters when you are power-bound.

NVIDIA reports Blackwell delivers about 10x higher throughput per megawatt than the prior generation on mixture-of-experts models. Stack the full GB300 NVL72 with the Dynamo and TensorRT-LLM serving software and the cited figures climb to as much as 50x throughput per megawatt and 35x lower cost per token versus Hopper. Treat the headline multiples with suspicion, they are best-case and vendor-framed, but even a heavily discounted version of that gap changes which chip wins under a power cap.

The trap is the three-year Hopper commitment somebody is about to sign in mid-2026. Vera Rubin NVL72 was announced at CES for the second half of 2026, with claims of up to 5x inference performance and 10x lower cost per token over Blackwell. If you lock Hopper for three years right before that lands, you will be serving the most expensive tokens in your own fleet by next year while paying off the depreciation on the slowest ones.

Precision is the cheapest lever you own

Before anyone buys a single GPU, the largest free win is numerical precision, and most teams leave it on the table.

Introl's unit-economics breakdown puts Llama 3.1 70B on an 8x H100 node at about $1.90 per million tokens in FP16. Move that same model to FP8 and it drops to roughly $0.95 to $1.10. You roughly halved your cost per token without touching hardware, without renegotiating power, without a procurement cycle. On Blackwell, native FP4 roughly doubles throughput again over FP8 where model quality holds up.

The caveat is real and I will not paper over it: FP4 is not free quality. On some workloads the accuracy hit shows up in eval and you back it out. But "test FP8 and FP4 on your actual traffic" costs an afternoon. "Buy more GPUs" costs a quarter and a power allocation you may not have.

Software gains compound faster than your hardware refresh cycle

This is the finding that should reorganize how you sequence spend. The B200 going from 11 cents to 2 cents per million tokens came from kernel and serving-stack work, not new chips. Speculative decoding alone moved one B200 result from 6,000 to 30,000 tokens per second per GPU, a 5x jump on hardware you already paid for.

Hardware refreshes arrive every 18 to 24 months. Serving-stack improvements ship every few weeks, and they compound on whatever silicon you already own. A team that chases the next GPU generation while running a stale inference engine is buying the expensive lever and ignoring the cheap one. Exhaust the software gains first, every time.

Cooling is now part of the GPU decision, not a facilities footnote

This is where my infrastructure bias is load-bearing, because cooling is the part platform teams reliably misfile as "someone else's line item."

Every watt you spend on chillers and fans is a watt you are not spending producing tokens. Fold that directly into the tokens-per-watt number. A liquid-cooled cluster running at PUE 1.10 carries about a 17 percent tokens-per-watt advantage over an air-cooled cluster at PUE 1.55 with identical GPUs. Same chips, same workload, 17 percent more sellable output, purely from where the heat goes. Under a power cap, that 17 percent is product you can ship or product you cannot.

It is also stopping being a choice. Rack density is averaging about 27 kW in 2026 and heading toward 45 to 100 kW by 2027. At Blackwell and Rubin densities, air cooling is not a worse option, it is not an option. If your facility is air-only and you are planning a 2027 refresh, the cooling retrofit is on the critical path for the GPU decision, not after it.

The ROI numbers are real, and they assume a machine that is never idle

NVIDIA cites a $5 million GB200 NVL72 generating $75 million in token revenue, a 15x return. That number is true and it is also a ceiling under perfect utilization, not a forecast.

The math that justifies high-density capital and the cooling around it only holds if the box is busy. An idle high-density rack is the worst position in the building: you are burning the cooling overhead and the depreciation with none of the token revenue that paid for either. I have seen utilization sit in the 30s for months on "strategic" GPU buys, and at that point the 15x return is a fraction of 15x and the cooling bill did not get the memo. Utilization is not a dashboard nicety here. It is the variable that decides whether the whole purchase was sound.

What to re-run this week

Each step ties back to a number above. None of it needs a procurement cycle to start.

Re-derive your real cost per million tokens before Friday. Take the node's all-in hourly cost, divide by measured tokens per second at your actual batch size and precision, not the spec sheet. If you priced chargeback before FP8 was standard, expect to be off by 2x or more (the $1.90 to $0.95 swing).
Move to FP8 now, then test FP4. FP8 roughly halves cost per token on the Llama 3.1 70B example. Pilot FP4 on Blackwell where eval quality holds, and back it out where it does not.
Turn on speculative decoding before you cost a hardware upgrade. It carried one B200 from 6,000 to 30,000 TPS per GPU. Capture the software 5x before you spend on the hardware 10x.
Score every GPU option on throughput per megawatt under your cap, not dollars per hour. If you are power-bound, that is the metric that maps to product. It is the 10x Blackwell-over-Hopper MoE gap made concrete.
Put PUE inside the tokens-per-watt number. The 17 percent liquid-versus-air advantage is free output, and above 45 kW per rack air cooling is off the menu, so the cooling plan gates the 2027 GPU choice.
Do not sign multi-year Hopper without modeling the Rubin curve. With Vera Rubin NVL72 claiming 10x lower cost per token in 2H 2026, build refresh flexibility into the contract or accept that you are buying the most expensive tokens in your fleet.
Attach a utilization floor to every ROI projection. The 15x GB200 return assumes a busy box. Track utilization as a first-class metric, because an idle high-density rack inverts that math fast.