plasmon

Posted on Apr 8

HBM4 Didn't Break the Memory Wall — It Just Moved It

#semiconductor #llm #hardware #ai

HBM4 Didn't Break the Memory Wall — It Just Moved It

HBM bandwidth has doubled every generation.

HBM2E (2020): 410 GB/s per stack — 1024-bit, 3.2 Gb/s/pin
HBM3  (2022): 819 GB/s per stack — 1024-bit, 6.4 Gb/s/pin
HBM3E (2024): 1.2 TB/s per stack — 1024-bit, up to 9.8 Gb/s/pin (JEDEC max, varies by vendor)
HBM4  (2026): 2.0 TB/s per stack — 2048-bit, 8.0 Gb/s/pin

Notice anything off?

HBM4's JEDEC base pin speed is 8.0 Gb/s. That's lower than HBM3E's max JEDEC spec of 9.8 Gb/s (Samsung's implementation; SK Hynix HBM3E runs at 8 Gb/s). Bandwidth doubled, but the base spec pin speed didn't go up. The majority of the bandwidth gain comes from doubling the interface width (1024 to 2048 bits).

They didn't make the pipe faster. They made it wider. That was HBM4's design decision, and it was driven by physics hitting back.

Why Pin Speed Hit a Ceiling

The Signal Integrity Wall

# HBM per-pin speed progression
pin_speed_history = {
    "HBM2E": {"speed": 3.2, "unit": "Gb/s", "year": 2020},
    "HBM3":  {"speed": 6.4, "unit": "Gb/s", "year": 2022},
    "HBM3E": {"speed": 9.8, "unit": "Gb/s", "year": 2024},
    "HBM4":  {"speed": 8.0, "unit": "Gb/s", "year": 2026},
}

# Pin speed growth rate
# HBM2E→HBM3: 2.0x (doubled in 2 years)
# HBM3→HBM3E: 1.53x (1.5x in 2 years)
# HBM3E→HBM4: 0.82x (decline)

Signaling through microbumps above 10 Gb/s gets ugly. TSV (Through-Silicon Via) parasitic capacitance and impedance mismatch cause jitter to blow up. HBM3E's 9.8 Gb/s was already pushing against that physical limit.

SK Hynix's 12-layer HBM4 sample shipped in March 2025 hit 11.7 Gb/s — meeting NVIDIA's Rubin requirements. The JEDEC base spec landed at 8 Gb/s, but vendor implementations routinely exceed base specs (same pattern as HBM3E). Mass production speeds are expected to land between 8-12 Gb/s.

# Why going wider is the safer bet
design_tradeoff = {
    "Increase pin speed": {
        "Upside": "No area increase",
        "Risk": "Jitter blowup, yield loss, power increase",
        "Limit": "~10 Gb/s (TSV parasitic capacitance)",
        "Cost": "Signal integrity circuits get complex fast",
    },
    "Widen the interface": {
        "Upside": "Double bandwidth while keeping signal quality intact",
        "Risk": "Die area increase, packaging cost increase",
        "Limit": "Physical bump pitch (40μm → 36μm → ?)",
        "Cost": "Die area = cost",
    },
}
# HBM4 chose the latter. Safe, but it gets the bandwidth numbers

What 2 TB/s Means for LLM Inference

A100 → H100 → B200 → Next-Gen Bandwidth Progression

# GPU accelerator HBM bandwidth over time
gpu_bandwidth = {
    "A100 80GB":  {"hbm": "HBM2E", "stacks": 5, "bw": "2.0 TB/s", "year": 2020},
    "H100 80GB":  {"hbm": "HBM3",  "stacks": 5, "bw": "3.35 TB/s", "year": 2022},
    "H200":       {"hbm": "HBM3E", "stacks": 6, "bw": "4.8 TB/s", "year": 2024},
    "B200":       {"hbm": "HBM3E", "stacks": 8, "bw": "8.0 TB/s", "year": 2025},
    "Next-gen (est.)": {"hbm": "HBM4", "stacks": 8, "bw": "16 TB/s", "year": "2026-27"},
}

# B200→next-gen: 2x bandwidth
# How much does this actually speed up LLM inference?

Recalculating the Decode Bandwidth Bottleneck

Taking the bandwidth-bound decode calculation from the KV cache article (separate article) and re-running it for the HBM4 generation.

# Llama-3-70B decode bandwidth requirements
model_params = {
    "parameters": 70e9,
    "bytes_per_param": 2,  # FP16
    "model_size": 140e9,   # 140 GB
}

# Generating 1 token requires reading the entire model (weight-bound decode)
# Theoretical max decode speed = bandwidth / model size

decode_speed = {
    "A100 (2 TB/s)":    f"{2000 / 140:.0f} t/s = 14 t/s",
    "H100 (3.35 TB/s)": f"{3350 / 140:.0f} t/s = 24 t/s",
    "H200 (4.8 TB/s)":  f"{4800 / 140:.0f} t/s = 34 t/s",
    "B200 (8 TB/s)":    f"{8000 / 140:.0f} t/s = 57 t/s",
    "HBM4 gen (16 TB/s)": f"{16000 / 140:.0f} t/s = 114 t/s",
}

# HBM4 generation: 70B model at 114 t/s
# 8x faster than A100's 14 t/s
# Still nowhere near 10,000 t/s

What LIMINAL Found

A team led by NVIDIA Research (Davies et al., arXiv:2507.14397) built an analytical model called LIMINAL and came to a sobering conclusion:

liminal_findings = {
    "model_accuracy": "7.6% mean absolute error vs. measured",
    "key_conclusion": "Reaching 10,000+ t/s requires fundamental algorithmic breakthroughs, not just hardware scaling",
    "four_barriers": [
        "Compute",
        "Memory capacity",
        "Memory bandwidth",
        "Collective communication",
    ],
    "implication": "HBM4 won't break through the bandwidth wall. It just pushes it back a bit",
}

10,000 t/s means generating a token every 0.1ms. For a 70B model (140GB at FP16), that requires 10,000 × 140 = 1,400 TB/s of bandwidth. Eight HBM4 stacks deliver 16 TB/s — roughly 1/88th of what's needed. Two orders of magnitude short. LIMINAL makes it clear: no amount of HBM generational scaling alone will bridge this gap.

What This Means for Consumer GPUs

RTX 4060 → Next-Gen Bandwidth Outlook

# Consumer GPU (GDDR) bandwidth over time
consumer_bandwidth = {
    "RTX 3060":  {"memory": "GDDR6",  "bw": "360 GB/s", "year": 2021},
    "RTX 4060":  {"memory": "GDDR6",  "bw": "272 GB/s", "year": 2023},
    "RTX 5060": {"memory": "GDDR7", "bw": "448 GB/s", "year": 2025},
    "RTX 6060 (est.)": {"memory": "GDDR7X?", "bw": "~550-600 GB/s", "year": "2027-28"},
}

# Note: RTX 4060 actually regressed in bandwidth (360→272)
# NVIDIA prioritizes cost over bandwidth on consumer cards
# Even with GDDR7, RTX 5060's 448 GB/s is roughly one HBM2E stack

# The datacenter-consumer bandwidth gap
gap = {
    "2024": f"B200 8 TB/s / RTX 4060 272 GB/s = {8000/272:.0f}x gap",
    "2027": f"HBM4 gen 16 TB/s / RTX 6060 est. 550 GB/s = {16000/550:.0f}x gap",
}
# The gap stays around 30x. Consumer will never catch up to datacenter

What Local LLM Users Should Take Away

# If bandwidth determines decode speed, is local LLM's future bleak?

local_llm_perspective = {
    "The pessimistic view": {
        "fact": "272 GB/s caps Qwen2.5-32B Q4_K_M (~20GB) at ~14 t/s",
        "fact2": "GDDR7 at 448 GB/s only gets you to ~22 t/s",
        "conclusion": "You can't win on bandwidth",
    },
    "The realistic view": {
        "counter1": "Quantization keeps advancing (Q2_K, 1.5-bit) — smaller models need less bandwidth",
        "counter2": "Small models are getting scary good (Qwen3.5 4B-class accuracy gains) — you may not need the big model",
        "counter3": "Multi-model orchestration — use bandwidth efficiently",
        "counter4": "Block selection optimization (software PRISM-like approaches) — read less data per token",
    },
    "core_truth": "You'll lose the bandwidth arms race. Win by not needing the bandwidth in the first place",
}

Datacenter GPU bandwidth grows at roughly 2x every two years. Consumer GPU bandwidth can actually go backwards when cost is prioritized (RTX 3060 to 4060: 360 to 272 GB/s). This gap is structural — it won't close. HBM4's 2 TB/s per stack is for datacenter in 2026, and it will never come to consumer hardware. HBM requires an interposer and packaging that simply doesn't fit a consumer GPU form factor.

The battlefield for local LLM isn't raw bandwidth. It's maximizing useful work per byte transferred — and that's a software and model design problem.

The Memory Wall Is Being Attacked from Three Directions

Three layers of approach exist for tackling the memory bandwidth bottleneck right now.

Layer 1: Raw hardware bandwidth increase
  HBM3→HBM3E→HBM4 (2x every 2 years)
  GDDR6→GDDR7 (up to ~1.6x on spec, actual products vary)
  → Reliable, but decelerating at the 10 Gb/s/pin physics wall

Layer 2: Reduce how much you read
  Quantization (FP16→Q4_K_M→Q2_K: 4-8x reduction)
  Sparse Attention (skip most of KV cache — reduction varies by method)
  Block selection (PRISM: 16x reduction)
  → Software-driven, immediate impact

Layer 3: Eliminate reads entirely
  PIM (Processing-In-Memory: compute where the data lives)
  Cache optimization (keep hot patterns in SRAM)
  → Requires hardware changes, but attacks the root cause

Effective bandwidth = L1 × L2 × L3
If HBM4 + Sparse Attention + PIM all come together:
  16 TB/s × 10x reduction (est.) × 2x PIM efficiency (est.) = 320 TB/s effective
  That gives a 70B model 2,285 t/s — 163x over today's A100
  Note: 10x and 2x are rough estimates from combining techniques, not measured values

Still short of 10,000 t/s. But the leap from 14 t/s to 2,000+ t/s is impossible with any single technology — it only becomes visible when you multiply all three layers together.

HBM4 didn't "double bandwidth." It updated Layer 1 of a three-layer stack. The other two layers are software and architecture problems — and that's where local LLM still has room to fight. Even an RTX 4060 at 272 GB/s can multiply its effective bandwidth several times over by maxing out Layer 2. You can lose on raw numbers and still win on how you use what you've got.

References

"LIMINAL: Exploring The Frontiers of LLM Decode Performance" (2025) arXiv:2507.14397
JEDEC JESD270-4 HBM4 Standard (2025)
SK Hynix HBM4 12-layer sample (March 2025) — 11.7 Gb/s, for NVIDIA Rubin
"PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection" (2026) arXiv:2603.21576

DEV Community

HBM4 Didn't Break the Memory Wall — It Just Moved It

HBM4 Didn't Break the Memory Wall — It Just Moved It

Why Pin Speed Hit a Ceiling

The Signal Integrity Wall

What 2 TB/s Means for LLM Inference

A100 → H100 → B200 → Next-Gen Bandwidth Progression

Recalculating the Decode Bandwidth Bottleneck

What LIMINAL Found

What This Means for Consumer GPUs

RTX 4060 → Next-Gen Bandwidth Outlook

What Local LLM Users Should Take Away

The Memory Wall Is Being Attacked from Three Directions

References

Top comments (0)