DEV Community

plasmon
plasmon

Posted on

99.8% of LLM Inference Power Isn't Spent on Computation

99.8% of LLM Inference Power Isn't Spent on Computation

When people debate LLM inference bottlenecks, bandwidth and VRAM dominate the conversation. But of the five walls identified by LIMINAL (Davies et al., arXiv:2507.14397), the hardest one to break through is power.

Bandwidth scales by widening the bus (HBM4 did exactly that). Capacity scales by stacking more dies. But power is directly chained to physics. The era when process shrinks automatically reduced power consumption ended around 2006, when Dennard scaling collapsed.

# The collapse of Dennard Scaling
dennard_scaling = {
    "1970-2006": {
        "rule": "Smaller transistors -> lower voltage -> constant power per area",
        "result": "Performance/W improved for free with every node shrink",
        "benefit": "Moore's Law + Dennard's Law in sync -> exponential perf gains",
    },
    "2006-present": {
        "reality": "Voltage can't drop further (subthreshold leakage)",
        "result": "Shrinking transistors no longer reduces per-transistor power",
        "mitigation": "Dark silicon, heterogeneous design, power-constrained design",
    },
}
# Since 2006, chips physically cannot fire all transistors simultaneously.
# Unused sections are intentionally powered off (dark silicon) to manage thermals.
Enter fullscreen mode Exit fullscreen mode

GPU Power Draw Over Time: A One-Way Escalator

Data Center GPUs

# NVIDIA GPU TDP (Thermal Design Power) progression
gpu_tdp = {
    "V100 (2017)":      {"tdp": 300, "unit": "W", "process": "12nm", "hbm": "HBM2"},
    "A100 (2020)":      {"tdp": 400, "unit": "W", "process": "7nm",  "hbm": "HBM2E"},
    "H100 (2022)":      {"tdp": 700, "unit": "W", "process": "4nm",  "hbm": "HBM3"},
    "H200 (2024)":      {"tdp": 700, "unit": "W", "process": "4nm",  "hbm": "HBM3E"},
    "B200 (2025)":      {"tdp": 1000, "unit": "W", "process": "4nm", "hbm": "HBM3E"},
    "Next gen (2026-27)": {"tdp": "1200-1500?", "unit": "W", "process": "3nm", "hbm": "HBM4"},
}

# V100 to B200: process shrank 12nm -> 4nm (3 generations) but TDP rose 300W -> 1000W (3.3x)
# Process efficiency improved, but transistor count increases more than canceled it out
# B200's 1000W requires liquid cooling. Air cooling hits its ceiling around 350W.
Enter fullscreen mode Exit fullscreen mode

Is Efficiency Keeping Up?

# Performance/W trend (inference throughput basis)
efficiency_trend = {
    "V100":  {"perf_per_watt": 1.0,  "relative": "baseline"},
    "A100":  {"perf_per_watt": 2.5,  "relative": "2.5x vs V100"},
    "H100":  {"perf_per_watt": 4.2,  "relative": "4.2x vs V100"},
    "B200":  {"perf_per_watt": 6.0,  "relative": "6.0x vs V100 (estimated)"},
}

# Perf/W improved 6x over 8 years
# Absolute performance improved 30-50x over the same period
# Translation: most of the performance gains came from "burning more watts"
# Efficiency gains explain only ~20% of the performance improvement
Enter fullscreen mode Exit fullscreen mode

The implication is blunt: LLM inference performance growth depends on dumping more power in. Efficiency alone doesn't cut it.


Power Cost Per Token

Decode Power Breakdown

# Power decomposition for decoding a 70B model (FP16)
decode_power_breakdown = {
    "Weight reads (HBM)": {
        "data": "140 GB per token",
        "HBM_power": "HBM3E: ~20 pJ/bit -> 140e9 * 8 * 20e-12 = 22.4 mJ/token",
        "note": "HBM energy cost scales linearly with bandwidth",
    },
    "Matrix ops (GPU cores)": {
        "ops": "~140 GFLOP per token (weight matmul)",
        "GPU_power": "H100: ~0.3 pJ/FLOP -> 140e9 * 0.3e-12 = 0.042 mJ/token",
        "note": "Actual computation energy is 1/500th of the data reads",
    },
    "KV cache reads": {
        "at_32K": "~8 GB per token (all layers)",
        "power": "8e9 * 8 * 20e-12 = 1.28 mJ/token",
        "note": "Scales linearly with context length",
    },
}

# The stunning ratio:
# Data movement: 23.7 mJ/token (99.8%)
# Computation:    0.042 mJ/token (0.2%)
# Nearly ALL power in LLM inference goes to "moving data around"
Enter fullscreen mode Exit fullscreen mode

This ratio breaks most people's intuition. GPUs are thought of as "compute" devices, but during LLM inference, 99.8% of the power goes to everything except computation -- reading data from memory and shuffling it across the chip.

Datacenter-Scale Power Consumption

# Power estimate for a GPT-4 class service
datacenter_power = {
    "assumptions": {
        "model": "~1.8T parameters (MoE, estimated)",
        "queries_per_day": 100_000_000,  # 100M queries/day
        "avg_tokens_per_query": 500,
        "gpu": "H100 (700W)",
        "throughput_per_gpu": "~150 tokens/s (estimated, with batching)",
    },
    "calculation": {
        "total_tokens_per_day": "50B tokens",
        "tokens_per_second": "~578,703 t/s",
        "gpus_needed": "578703 / 150 = ~3,858 GPUs",
        "power": "3858 * 700W = 2.7 MW (GPUs only)",
        "with_cooling_network": "2.7 MW * 1.5 (PUE) = ~4.0 MW",
        "annual_power": "4.0 MW * 8760h = ~35 GWh/year",
    },
    "cost": {
        "electricity": "35 GWh * $0.05/kWh = ~$1.75M/year (electricity only)",
        "per_query": "$1.75M / 365 / 100M = ~$0.000048/query",
        "per_1k_tokens": "~$0.001 (electricity cost portion only)",
    },
}
Enter fullscreen mode Exit fullscreen mode

GPU power alone: 35 GWh per year. That's roughly equivalent to the annual consumption of 3,500 US households. And this is inference only, for one service. Training costs 10x more.


Three Constraints the Power Wall Imposes on LLM Inference

Constraint 1: Scale-Out Hits a Ceiling

# Datacenter power constraints
datacenter_constraints = {
    "typical_rack_power": "20-30 kW per rack",
    "h100_per_node": "8 GPUs (DGX H100 node)",
    "node_power_h100": "8 * 700W + CPU/NVSwitch/NIC = ~10 kW (per node)",
    "b200_per_node": "8 GPUs (DGX B200 node)",
    "node_power_b200": "8 * 1000W + CPU/NVSwitch/NIC = ~14 kW (per node)",

    "problem": "Adding more GPUs doesn't help if the datacenter's power supply is the bottleneck",
    "reality": "Most existing datacenters are 20MW class. New builds take 3-5 years",
    "trend": "Microsoft and OpenAI are planning 1GW-class datacenters (one nuclear reactor's worth)",
}
Enter fullscreen mode Exit fullscreen mode

The power wall puts a hard physics cap on the "just buy more GPUs" strategy.

Constraint 2: Thermal Limits on Consumer Devices

# RTX 4060 8GB: power and thermal reality
rtx4060_thermal = {
    "TDP": "115W (laptop variant)",
    "GPU_die_area": "~159 mm2",
    "power_density": "115 / 159 = 0.72 W/mm2",

    "During LLM inference": {
        "typical_power": "60-80W (not full load, but memory access is heavy)",
        "memory_controller": "Sustaining 272 GB/s bandwidth alone costs ~15W",
        "note": "Inference taxes the memory controller harder than the compute units",
    },

    "Practical limits": {
        "thermal_throttling": "Kicks in at high GPU temperatures, drops inference speed",
        "battery_operation": "Not viable (60Wh battery = under 1 hour)",
        "sustained_inference": "Fine with adequate cooling, constrained in thin laptops",
    },
}
Enter fullscreen mode Exit fullscreen mode

Local LLM speed is "bandwidth-bound" -- everyone knows that. What's less discussed is that using bandwidth itself costs power. Maintaining 272 GB/s eats ~15W at the memory controller alone. As models grow and bandwidth demand climbs, power consumption follows proportionally.

Constraint 3: Tokens Per Watt

# Power efficiency across hardware
tokens_per_watt = {
    "RTX 4060 (Qwen2.5-32B Q4_K_M)": {
        "speed": "10.8 t/s",
        "power": "~70W",
        "efficiency": "10.8 / 70 = 0.154 t/s/W",
    },
    "RTX 4060 (Qwen3.5-4B Q4_K_M)": {
        "speed": "~50 t/s (estimated)",
        "power": "~40W (smaller models draw less power)",
        "efficiency": "50 / 40 = 1.25 t/s/W",
    },
    "M4 Mac mini (Qwen2.5-32B Q4_K_M)": {
        "speed": "~8 t/s",
        "power": "~30W (Apple Silicon efficiency)",
        "efficiency": "8 / 30 = 0.27 t/s/W",
    },
    "H100 (Llama-3-70B FP16, batch=32)": {
        "speed": "~768 t/s (32 parallel × 24 t/s, weight reads shared across batch)",
        "power": "700W",
        "efficiency": "768 / 700 = 1.10 t/s/W",
    },
}

# H100 batched at 1.10 t/s/W — comparable to RTX 4060 small model (1.25)
# On a single request, H100 does ~24 t/s (bandwidth-bound: 3.35TB/s / 140GB) -> 0.034 t/s/W
# A small model on RTX 4060 is 37x more power-efficient than single-request H100
Enter fullscreen mode Exit fullscreen mode

Here's where it gets interesting. For workloads that can't batch -- personal use, real-time conversation -- a small local model beats datacenter GPUs on power efficiency. Qwen3.5 4B on an RTX 4060 at 1.25 t/s/W is 37x more efficient than an H100 serving a single request at 0.034 t/s/W. Even against batched H100 (1.10 t/s/W), the 40W laptop GPU with a 4B model wins. A 700W datacenter GPU losing to a laptop on power efficiency.


Three Approaches to Attacking the Power Wall

I wrote about a three-layer approach to the bandwidth wall (separate article) previously. The power wall has the same structure.

Layer 1: Chip-Level Power Efficiency
  Process shrinks (5nm -> 3nm -> 2nm): 10-20% improvement per generation
  Power delivery improvements (BSPDN backside power): 30% IR drop reduction
    -> smaller voltage margins -> power savings
  -> Reliable but slow. Post-Dennard improvements are incremental.

Layer 2: Architecture-Level Power Efficiency
  Sparse Attention: skip unnecessary ops -> direct power savings
  Quantization (INT8/INT4): fewer bits -> 1/4 to 1/16 compute power
  MoE (Mixture of Experts): top-2-of-8 activation -> memory bandwidth power at 1/4
  -> Software-level, immediately deployable

Layer 3: Changing the Compute Paradigm
  PIM (Processing-In-Memory): eliminate data movement -> attack the 99.8%
  Photonic computing: matrix ops via light interference -> near-zero power
  Analog compute (BrainScaleS-2 etc.): eliminate digital conversion
  -> Research stage, but the only fundamental fix

Effective power efficiency = L1 x L2 x L3
Layer 2 alone with MoE + INT4 quantization:
  Memory bandwidth power reduced to 1/4 (MoE top-2-of-8) x 1/4 (INT4) = 1/16
  A 70B model's effective power drops to 4.4B-model territory
Enter fullscreen mode Exit fullscreen mode

Layer 2 delivers the fastest results. And it's already accessible to local LLM users. If you're running Q4_K_M quantized models, you're already benefiting from Layer 2. Choosing MoE models (Mixtral, DeepSeek-V3) is also a correct move from a power efficiency standpoint.


Local LLMs and Power Efficiency

The Consumer GPU Advantage

# The power efficiency paradox
power_paradox = {
    "conventional_wisdom": "Datacenter GPUs are more power-efficient",
    "reality": {
        "with_batching": "H100 at batch=32: 1.10 t/s/W — comparable to RTX 4060 small model",
        "single_request": "RTX 4060 small model is 37x more efficient (1.25 vs 0.034 t/s/W)",
    },
    "reason": {
        "H100_at_700W": "Most power consumed by idle memory banks and interconnect",
        "RTX_4060_at_40W": "Small model keeps the whole system at high utilization",
    },
    "conclusion": "For personal use, local LLMs are rational from a power perspective too",
}
Enter fullscreen mode Exit fullscreen mode

Datacenter GPUs are designed around the assumption of concurrent multi-request processing. When a single user sends a single request, most of those 700W go to waste. Running a 4B model on an RTX 4060 is far more power-efficient for that use case.

Maximizing Power Efficiency on an RTX 4060 in Practice

1. Prefer smaller models
   -> Qwen3.5 4B (3.4GB, ~40W) delivers 8x the t/s/W of Qwen2.5-32B (18GB, ~70W)
   -> If the task allows it, always reach for the smallest viable model

2. Choose MoE models
   -> Same parameter count, but fewer active parameters means less power draw
   -> Mixtral 8x7B has 3-4x the parameter efficiency of a dense 47B

3. Keep context short
   -> KV cache reads consume power
   -> Use RAG to retrieve only what's needed; don't dump the full document

4. Idle the GPU when inference isn't needed
   -> RTX 4060 idle draw is ~10W
   -> Stopping background inference saves 50-60W instantly
Enter fullscreen mode Exit fullscreen mode

References

  1. "LIMINAL: Exploring The Frontiers of LLM Decode Performance" (2025) arXiv:2507.14397
  2. Dennard, R. H. et al. "Design of Ion-Implanted MOSFET's with Very Small Physical Dimensions" (1974) IEEE JSSC
  3. "The Efficiency Misnomer" -- Patterson et al. (2021) arXiv:2110.11822
  4. NVIDIA B200 specifications (2025) -- TDP 1000W, HBM3E 8TB/s

Top comments (0)