The Wall I Hit on an RTX 4060 Was a Bandwidth Wall
Running Qwen3.5-9B on an RTX 4060 8GB gets you about 40 tok/s. Perfectly usable for a reasoning model. But scale up the model size and the numbers crater. 27B drops to 15 tok/s. 32B at Q4 quantization barely holds 10 tok/s.
The bottleneck isn't GPU compute. It's memory bandwidth.
LLM inference — especially the token generation phase — is rate-limited by how fast model weights can be read out of VRAM. The RTX 4060's GDDR6 bandwidth is 272 GB/s. A 4.1GB model can theoretically be read 66 times per second, but a 9GB model only 30 times, and 18GB only 15 times. Real-world numbers beat theoretical thanks to caching effects, but the fundamental structure — bandwidth sets the ceiling — doesn't change.
The real problem is that this ceiling is moving at completely different speeds for datacenters and consumers.
Datacenter Side: The HBM3→HBM3E→HBM4 Bandwidth Explosion
Here's the datacenter GPU memory bandwidth progression.
[HBM Memory Bandwidth Evolution]
Generation BW/Stack Stacks Total GPU BW Representative GPU
HBM2e 400 GB/s 5 2.0 TB/s A100 (80GB)
HBM3 670 GB/s 5 3.35 TB/s H100 (80GB)
HBM3E 800 GB/s 6 4.8 TB/s H200 (141GB)
HBM4 2.0+ TB/s 8 ~22 TB/s Vera Rubin (NVIDIA official)
Growth rate: 1.5-2.0x per generation
~8x over 4 generations
SK Hynix begins mass production of HBM4 in Q3 2026. 16-layer stacking, 48GB per stack, 2+ TB/s bandwidth. Interface width doubles to 2048 bits (from HBM3E's 1024 bits).
When NVIDIA's Vera Rubin generation ships with HBM4, a single GPU gets roughly 22 TB/s of memory bandwidth (NVIDIA official announcement). That's about 6.6x the H100's 3.35 TB/s.
Consumer Side: GDDR6→GDDR7 Bandwidth Gains
Now the consumer GPU picture.
[Consumer GPU Memory Bandwidth Progression]
GPU Memory Type Bus Width Bandwidth VRAM
RTX 3060 GDDR6 192bit 360 GB/s 12GB
RTX 4060 GDDR6 128bit 272 GB/s 8GB
RTX 4060 Ti GDDR6 128bit 288 GB/s 8/16GB
RTX 5060 Ti GDDR7 128bit 448 GB/s 16GB
RTX 5060 GDDR7 128bit 448 GB/s 8GB
Growth rate: 1.2-1.6x per generation
The RTX 5060 Ti adopts GDDR7 and reaches 448 GB/s. A 65% bump over the RTX 4060. Looks like respectable generational progress in isolation.
But the bus width stays at 128 bits. The bandwidth increase comes entirely from higher memory chip speeds (28 Gbps), not from any architectural change. Once GDDR7 chip speeds plateau, bandwidth won't grow unless the bus gets wider. And NVIDIA has been narrowing consumer bus widths, not widening them.
The Gap Isn't Closing. It's Widening.
Line up the numbers and the structure becomes obvious.
[Datacenter vs Consumer Bandwidth Gap]
Generation DC Bandwidth Consumer BW Gap
2022 (A100) 2.0 TB/s 360 GB/s 5.6x
2023 (H100) 3.35 TB/s 272 GB/s 12.3x
2024 (H200) 4.8 TB/s 288 GB/s 16.7x
2026 (HBM4) ~22 TB/s 448 GB/s ~49x
The gap in 2022 was 5.6x. By 2026 it's roughly 49x (Vera Rubin bandwidth per NVIDIA official announcement).
Pulling in the high-end consumer card (RTX 4090: 1,008 GB/s) doesn't fundamentally change the picture. Even if an RTX 5090 hits ~1.8 TB/s with GDDR7, the gap to HBM4 GPUs is still over 12x. The structural bandwidth gap between HBM and GDDR architectures isn't going anywhere.
Why This Gap Exists
HBM and GDDR Are Physically Different Animals
HBM (High Bandwidth Memory) vertically stacks DRAM dies and connects them with TSVs (Through-Silicon Vias). A single stack has 1024+ bits of bus width.
GDDR connects chips on a PCB to the GPU die via solder bumps. Bus width is constrained by how many traces you can route on a PCB. 128-bit is mainstream, 384-512 bits at the high end.
[Bandwidth = Bus Width × Transfer Speed]
HBM4: 2048bit × pin speed = 2.0+ TB/s (per stack, SK Hynix spec)
GDDR7: 128bit × 28 Gbps = 448 GB/s (all chips combined)
Bus width difference: 16x (2048 vs 128)
Speed difference: GDDR has higher pin speed
Result: Bus width physical advantage dominates
HBM loses on raw speed but dominates on bus width — a physical advantage. TSVs pack thousands of vertical interconnects into a few square millimeters of die area. That kind of wiring density is physically impossible with PCB traces.
Cost Won't Allow It
HBM4 is estimated at roughly $500 per stack (industry analyst estimates). Eight stacks means $4,000 in memory alone — far exceeding the entire price of a consumer GPU.
Even if you put a single HBM stack on a consumer card, a $500 memory module on a $299 GPU doesn't make a viable product. GDDR chips cost $5-15 each. The cost structures are fundamentally different.
Is CXL the Savior?
CXL (Compute Express Link) 3.0 has been getting attention as a datacenter memory pooling technology. Based on PCIe 6.0, it offers 128 GB/s unidirectional read bandwidth (x16 lanes) and enables memory sharing and tiering.
But CXL doesn't solve the LLM inference bandwidth problem.
The reason is straightforward: CXL's read bandwidth of 128 GB/s is 1/16th of HBM4's 2 TB/s. Reading model weights from CXL-attached memory runs at 1/16th the speed of GPU-HBM direct access. For LLM inference, that means token generation speed drops by 16x.
CXL's value is in memory capacity expansion, not bandwidth expansion. It can give you the space to hold enormous models, but inference speed is governed by bandwidth. This is the core of SemiAnalysis's argument in "CXL Is Dead In The AI Era."
What Local LLM Users Can Actually Do
The bandwidth gap is rooted in physics and cost structures. No individual is going to close it. But there are concrete ways to maximize efficiency within bandwidth constraints.
1. Quantization Is a Bandwidth Hack
Q4_K_M quantization shrinks a 16-bit model to 1/4 its size. That's mathematically equivalent to getting 4x the bandwidth.
[Quantization vs Effective Bandwidth]
Quantization Model Size Theoretical Speed at 272 GB/s
FP16 18.0 GB 15.1 tok/s
Q8_0 9.0 GB 30.2 tok/s
Q4_K_M 4.5 GB 60.4 tok/s
Q3_K_S 3.4 GB 80.0 tok/s
Q4 effectively quadruples your bandwidth.
Accuracy loss varies by model and benchmark, but the 4x bandwidth trade-off is rational.
2. KV Cache Saves Bandwidth
You don't need to read the entire model on every token generation step. The KV cache stores past computation results, so attention layer keys and values don't need to be recomputed.
This is why llama.cpp benchmarks on an RTX 4060 sometimes exceed the theoretical bandwidth limit — high cache hit rates increase effective bandwidth. Flash Attention pushes this cache efficiency even further.
3. Speculative Decoding Improves Bandwidth Utilization
A small draft model generates multiple candidate tokens ahead of time, and the main model verifies them in a single batch. Verification runs in parallel, so bandwidth utilization goes up.
Using a 1.5B draft model for Qwen3.5-9B on an RTX 4060, you can expect 1.3-1.5x speedups depending on acceptance rates. Same bandwidth, higher throughput.
4. Model Selection Is the Biggest Lever
When bandwidth is limited, choosing the highest-quality model that fits within your bandwidth budget becomes the single most important decision.
[RTX 4060 8GB — Optimal Model Selection Within Bandwidth Budget]
Use Case Recommended Model Reason
General chat Qwen3.5-9B Q4_K_M Best intelligence within bandwidth
Code generation DeepSeek-Coder-V2-Lite 16B MoE, 2.4B active params
Translation NLLB-200-3.3B Purpose-built = best bandwidth efficiency
RAG BGE-M3 + Qwen2.5-7B Embeddings are fine at small scale
Under bandwidth constraints, using multiple specialized models beats throwing everything at one monolithic model.
The Bandwidth Gap Isn't All Bad
After all this talk about the growing divide, there's one counterintuitive fact worth noting.
The datacenter bandwidth explosion raises the quality of local LLMs too. HBM4-generation bandwidth enables training of increasingly large models. Quantize those trained models and bring them to local hardware, and even an 8GB GPU benefits.
As the bandwidth gap widens, the division of labor between training and inference becomes more rational. Datacenters train massive models. Individuals quantize and run them locally. This structure is actually healthy.
Where it hurts is use cases that demand real-time bandwidth during inference — interactive code completion, real-time translation, long-context batch processing. There, local bandwidth constraints directly cap the quality of experience.
The RTX 5060 Ti's 448 GB/s gives you roughly 100 tok/s on a Q4-quantized 9B model. Plenty for everyday conversations. But running frontier models locally? The bandwidth is off by an order of magnitude. Accept that reality, then squeeze maximum efficiency out of quantization and model selection. That's the individual-scale optimal play for 2026.
References
- SK Hynix HBM4 — 16-layer, 48GB, 2+ TB/s — HBM4 mass production starting Q3 2026
- The State of HBM4 at CES 2026 — 2048-bit IF
- RTX 5060 Ti GDDR7 448 GB/s — 128bit bus, 28 Gbps GDDR7
- CXL Is Dead In The AI Era — SemiAnalysis CXL analysis
- Scaling the Memory Wall: HBM Roadmap — HBM evolution roadmap
Top comments (0)