DEV Community

plasmon
plasmon

Posted on • Originally published at qiita.com

The Memory Bandwidth Gap Is 49x and Growing — Why Local LLMs Hit a Ceiling

The Wall I Hit on an RTX 4060 Was a Bandwidth Wall

Running Qwen3.5-9B on an RTX 4060 8GB gets you about 40 tok/s. Perfectly usable for a reasoning model. But scale up the model size and the numbers crater. 27B drops to 15 tok/s. 32B at Q4 quantization barely holds 10 tok/s.

The bottleneck isn't GPU compute. It's memory bandwidth.

LLM inference — especially the token generation phase — is rate-limited by how fast model weights can be read out of VRAM. The RTX 4060's GDDR6 bandwidth is 272 GB/s. A 4.1GB model can theoretically be read 66 times per second, but a 9GB model only 30 times, and 18GB only 15 times. Real-world numbers beat theoretical thanks to caching effects, but the fundamental structure — bandwidth sets the ceiling — doesn't change.

The real problem is that this ceiling is moving at completely different speeds for datacenters and consumers.


Datacenter Side: The HBM3→HBM3E→HBM4 Bandwidth Explosion

Here's the datacenter GPU memory bandwidth progression.

[HBM Memory Bandwidth Evolution]

Generation    BW/Stack        Stacks    Total GPU BW      Representative GPU
HBM2e         400 GB/s        5         2.0 TB/s          A100 (80GB)
HBM3          670 GB/s        5         3.35 TB/s         H100 (80GB)
HBM3E         800 GB/s        6         4.8 TB/s          H200 (141GB)
HBM4          2.0+ TB/s       8         ~22 TB/s          Vera Rubin (NVIDIA official)

Growth rate: 1.5-2.0x per generation
~8x over 4 generations
Enter fullscreen mode Exit fullscreen mode

SK Hynix begins mass production of HBM4 in Q3 2026. 16-layer stacking, 48GB per stack, 2+ TB/s bandwidth. Interface width doubles to 2048 bits (from HBM3E's 1024 bits).

When NVIDIA's Vera Rubin generation ships with HBM4, a single GPU gets roughly 22 TB/s of memory bandwidth (NVIDIA official announcement). That's about 6.6x the H100's 3.35 TB/s.


Consumer Side: GDDR6→GDDR7 Bandwidth Gains

Now the consumer GPU picture.

[Consumer GPU Memory Bandwidth Progression]

GPU               Memory Type    Bus Width    Bandwidth    VRAM
RTX 3060          GDDR6          192bit       360 GB/s     12GB
RTX 4060          GDDR6          128bit       272 GB/s     8GB
RTX 4060 Ti       GDDR6          128bit       288 GB/s     8/16GB
RTX 5060 Ti       GDDR7          128bit       448 GB/s     16GB
RTX 5060          GDDR7          128bit       448 GB/s     8GB

Growth rate: 1.2-1.6x per generation
Enter fullscreen mode Exit fullscreen mode

The RTX 5060 Ti adopts GDDR7 and reaches 448 GB/s. A 65% bump over the RTX 4060. Looks like respectable generational progress in isolation.

But the bus width stays at 128 bits. The bandwidth increase comes entirely from higher memory chip speeds (28 Gbps), not from any architectural change. Once GDDR7 chip speeds plateau, bandwidth won't grow unless the bus gets wider. And NVIDIA has been narrowing consumer bus widths, not widening them.


The Gap Isn't Closing. It's Widening.

Line up the numbers and the structure becomes obvious.

[Datacenter vs Consumer Bandwidth Gap]

Generation          DC Bandwidth    Consumer BW     Gap
2022 (A100)         2.0 TB/s        360 GB/s        5.6x
2023 (H100)         3.35 TB/s       272 GB/s       12.3x
2024 (H200)         4.8 TB/s        288 GB/s       16.7x
2026 (HBM4)         ~22 TB/s        448 GB/s       ~49x
Enter fullscreen mode Exit fullscreen mode

The gap in 2022 was 5.6x. By 2026 it's roughly 49x (Vera Rubin bandwidth per NVIDIA official announcement).

Pulling in the high-end consumer card (RTX 4090: 1,008 GB/s) doesn't fundamentally change the picture. Even if an RTX 5090 hits ~1.8 TB/s with GDDR7, the gap to HBM4 GPUs is still over 12x. The structural bandwidth gap between HBM and GDDR architectures isn't going anywhere.


Why This Gap Exists

HBM and GDDR Are Physically Different Animals

HBM (High Bandwidth Memory) vertically stacks DRAM dies and connects them with TSVs (Through-Silicon Vias). A single stack has 1024+ bits of bus width.

GDDR connects chips on a PCB to the GPU die via solder bumps. Bus width is constrained by how many traces you can route on a PCB. 128-bit is mainstream, 384-512 bits at the high end.

[Bandwidth = Bus Width × Transfer Speed]

HBM4:  2048bit × pin speed = 2.0+ TB/s (per stack, SK Hynix spec)
GDDR7: 128bit  × 28 Gbps  = 448 GB/s  (all chips combined)

Bus width difference: 16x (2048 vs 128)
Speed difference:     GDDR has higher pin speed
Result:               Bus width physical advantage dominates
Enter fullscreen mode Exit fullscreen mode

HBM loses on raw speed but dominates on bus width — a physical advantage. TSVs pack thousands of vertical interconnects into a few square millimeters of die area. That kind of wiring density is physically impossible with PCB traces.

Cost Won't Allow It

HBM4 is estimated at roughly $500 per stack (industry analyst estimates). Eight stacks means $4,000 in memory alone — far exceeding the entire price of a consumer GPU.

Even if you put a single HBM stack on a consumer card, a $500 memory module on a $299 GPU doesn't make a viable product. GDDR chips cost $5-15 each. The cost structures are fundamentally different.


Is CXL the Savior?

CXL (Compute Express Link) 3.0 has been getting attention as a datacenter memory pooling technology. Based on PCIe 6.0, it offers 128 GB/s unidirectional read bandwidth (x16 lanes) and enables memory sharing and tiering.

But CXL doesn't solve the LLM inference bandwidth problem.

The reason is straightforward: CXL's read bandwidth of 128 GB/s is 1/16th of HBM4's 2 TB/s. Reading model weights from CXL-attached memory runs at 1/16th the speed of GPU-HBM direct access. For LLM inference, that means token generation speed drops by 16x.

CXL's value is in memory capacity expansion, not bandwidth expansion. It can give you the space to hold enormous models, but inference speed is governed by bandwidth. This is the core of SemiAnalysis's argument in "CXL Is Dead In The AI Era."


What Local LLM Users Can Actually Do

The bandwidth gap is rooted in physics and cost structures. No individual is going to close it. But there are concrete ways to maximize efficiency within bandwidth constraints.

1. Quantization Is a Bandwidth Hack

Q4_K_M quantization shrinks a 16-bit model to 1/4 its size. That's mathematically equivalent to getting 4x the bandwidth.

[Quantization vs Effective Bandwidth]

Quantization    Model Size    Theoretical Speed at 272 GB/s
FP16            18.0 GB       15.1 tok/s
Q8_0             9.0 GB       30.2 tok/s
Q4_K_M           4.5 GB       60.4 tok/s
Q3_K_S           3.4 GB       80.0 tok/s

Q4 effectively quadruples your bandwidth.
Accuracy loss varies by model and benchmark, but the 4x bandwidth trade-off is rational.
Enter fullscreen mode Exit fullscreen mode

2. KV Cache Saves Bandwidth

You don't need to read the entire model on every token generation step. The KV cache stores past computation results, so attention layer keys and values don't need to be recomputed.

This is why llama.cpp benchmarks on an RTX 4060 sometimes exceed the theoretical bandwidth limit — high cache hit rates increase effective bandwidth. Flash Attention pushes this cache efficiency even further.

3. Speculative Decoding Improves Bandwidth Utilization

A small draft model generates multiple candidate tokens ahead of time, and the main model verifies them in a single batch. Verification runs in parallel, so bandwidth utilization goes up.

Using a 1.5B draft model for Qwen3.5-9B on an RTX 4060, you can expect 1.3-1.5x speedups depending on acceptance rates. Same bandwidth, higher throughput.

4. Model Selection Is the Biggest Lever

When bandwidth is limited, choosing the highest-quality model that fits within your bandwidth budget becomes the single most important decision.

[RTX 4060 8GB — Optimal Model Selection Within Bandwidth Budget]

Use Case             Recommended Model            Reason
General chat         Qwen3.5-9B Q4_K_M            Best intelligence within bandwidth
Code generation      DeepSeek-Coder-V2-Lite        16B MoE, 2.4B active params
Translation          NLLB-200-3.3B                 Purpose-built = best bandwidth efficiency
RAG                  BGE-M3 + Qwen2.5-7B           Embeddings are fine at small scale
Enter fullscreen mode Exit fullscreen mode

Under bandwidth constraints, using multiple specialized models beats throwing everything at one monolithic model.


The Bandwidth Gap Isn't All Bad

After all this talk about the growing divide, there's one counterintuitive fact worth noting.

The datacenter bandwidth explosion raises the quality of local LLMs too. HBM4-generation bandwidth enables training of increasingly large models. Quantize those trained models and bring them to local hardware, and even an 8GB GPU benefits.

As the bandwidth gap widens, the division of labor between training and inference becomes more rational. Datacenters train massive models. Individuals quantize and run them locally. This structure is actually healthy.

Where it hurts is use cases that demand real-time bandwidth during inference — interactive code completion, real-time translation, long-context batch processing. There, local bandwidth constraints directly cap the quality of experience.

The RTX 5060 Ti's 448 GB/s gives you roughly 100 tok/s on a Q4-quantized 9B model. Plenty for everyday conversations. But running frontier models locally? The bandwidth is off by an order of magnitude. Accept that reality, then squeeze maximum efficiency out of quantization and model selection. That's the individual-scale optimal play for 2026.


References

Top comments (0)