20260325_vram_expansion_physics_en

Adding More VRAM Won't Fix It — The Physics That HBM, CXL, and Unified Memory Can't Escape

The RTX 4060's 8GB VRAM caps out at 7B models. Even when the RTX 5060 doubles that to 16GB, a full 70B won't fit. "If VRAM's not enough, just add more" — the idea is sound, but the execution hits three distinct physical tradeoffs.

HBM, CXL, Unified Memory. These three technologies attack the VRAM wall from different angles. Where each one sits on the triangle of bandwidth, capacity, and cost fundamentally changes how LLM inference performs.

The Memory Trilemma: Bandwidth, Capacity, Cost

# Physical tradeoffs across memory technologies
memory_trilemma = {
    "HBM3E": {
        "bandwidth": "4.8 TB/s (H200, 6 stacks)",
        "capacity": "141 GB (H200)",
        "cost_per_GB": "~$10-15/GB (HBM3E, 2025 market price)",
        "interface": "TSV (Through-Silicon Via), 1024-bit wide per stack",
        "physics": "Vertically stacked via through-silicon vias. High bandwidth, but eats die area",
    },
    "GDDR6 (RTX 4060)": {
        "bandwidth": "272 GB/s",
        "capacity": "8 GB",
        "cost_per_GB": "~$2.5-4/GB (GDDR6 spot price, 2025)",
        "interface": "128-bit bus, 2125 MHz (17 Gbps effective)",
        "physics": "Solder-bonded on PCB. Cheap, but bandwidth-limited",
    },
    "CXL 3.1": {
        "bandwidth": "64 GB/s per link (x16 PCIe 6.0, unidirectional)",
        "capacity": "Theoretically TB-class (memory pooling)",
        "cost_per_GB": "~$3-5/GB (DDR5-based)",
        "interface": "PCIe 6.0 physical layer (64 GT/s) + CXL protocol",
        "physics": "Reuses existing PCIe infrastructure. 1/75 the bandwidth of HBM3E",
    },
    "Unified Memory (M4 Max)": {
        "bandwidth": "546 GB/s (LPDDR5X)",
        "capacity": "128 GB",
        "cost_per_GB": "Depends on Apple pricing (LPDDR5X itself ~$3-5/GB, but SoC integration makes direct comparison impossible)",
        "interface": "LPDDR5X, 512-bit bus",
        "physics": "CPU/GPU/NPU share one memory pool. Shared bandwidth = contention",
    },
}

These three technologies occupy different vertices of the bandwidth-capacity-cost triangle. HBM chose bandwidth, CXL chose capacity, Unified Memory chose balance. None of them can claim all three.

HBM: King of Bandwidth, Slave to Capacity

# Physical constraints of HBM
hbm_constraints = {
    "bandwidth_source": {
        "TSV_per_stack": "~5,000+ through-silicon vias",
        "bus_width": "1024 bit per stack",
        "stacks": "H200: 6 stacks → 6144 bit total bus",
        "result": "4.8 TB/s — 18x GDDR",
    },
    "capacity_wall": {
        "die_per_stack": "Current: 8-Hi (8 layers stacked), next-gen: 12-Hi/16-Hi",
        "die_size": "24 Gbit per die (3GB) for HBM3E",
        "8Hi_capacity": "8 × 3GB = 24 GB per stack",
        "12Hi_capacity": "12 × 3GB = 36 GB per stack",
        "total_H200": "6 stacks × 24GB = 144 GB raw (NVIDIA-rated 141 GB, some reserved)",
        "cost": "One HBM3E stack: estimated $240-360 (24GB x $10-15/GB, 2025 market price)",
    },
    "area_problem": {
        "interposer_area": "Each HBM stack occupies ~100 mm2 of interposer area",
        "GPU_die + 6_stacks": "GPU ~800 mm2 + HBM ~600 mm2 = ~1400 mm2 interposer",
        "CoWoS_reticle_limit": "~1700 mm2 (TSMC lithography limit)",
        "implication": "Fitting 8+ HBM stacks exceeds the reticle limit → chiplet design required",
    },
}

# HBM has the best bandwidth, but capacity is physically capped by layer count × stack count × interposer area
# "Just add more HBM" → the interposer doesn't have room

The reason HBM can't "just be scaled up" is area. The GPU die and HBM stacks must sit side by side on an interposer, and the CoWoS reticle limit (~1700 mm²) is the ceiling. The H200 is already close to that limit.

Impact on LLM inference:

# How HBM's capacity ceiling affects LLM inference
hbm_llm_impact = {
    "H200 (141GB HBM3E)": {
        "max_model_fp16": "~70B parameters (140GB)",
        "max_model_q4": "~280B parameters (70GB) + KV cache headroom",
        "70B_kv_cache_room": "141 - 140 = 1 GB → even 32K context is tight",
        "solution": "Quantization or Tensor Parallelism (multi-GPU)",
    },
    "RTX 4060 (8GB GDDR6)": {
        "max_model_q4": "~13B parameters (7.2GB usable)",
        "bandwidth": "272 GB/s → 7B Q4_K_M at ~32 t/s",
        "problem": "13B+ requires CPU offload → 1/10 speed",
    },
    "RTX 5060 (expected 16GB GDDR7)": {
        "bandwidth": "448 GB/s (RTX 5060 Ti confirmed; RTX 5060 non-Ti TBD)",
        "max_model_q4": "~30B parameters",
        "implication": "2x capacity ≠ 2x model size (KV cache eats the difference)",
    },
}

Doubling VRAM doesn't double the model size you can run. KV cache is the reason. A 70B FP16 model's KV cache at 32K context is about 8GB. The "leftover" VRAM gets consumed by KV cache.

CXL: Capacity Unleashed, Bandwidth Sacrificed

CXL (Compute Express Link) is a memory expansion protocol built on the PCIe physical layer.

# CXL bandwidth and capacity
cxl_specs = {
    "CXL 3.1 (2024)": {
        "physical_layer": "PCIe 6.0 (64 GT/s)",
        "bandwidth_x16": "64 GB/s (unidirectional, PCIe 6.0 x16)",
        "latency": "~170-400 ns (measured, varies by device/config; 2-4x local DDR5)",
        "capacity": "Theoretically unlimited (memory pooling + switching)",
        "target": "Servers / data centers",
    },
    "bandwidth_comparison": {
        "HBM3E (H200)": "4,800 GB/s",
        "GDDR6_RTX4060": "272 GB/s",
        "CXL_3.1_x16": "64 GB/s",
        "ratio": "CXL is 1/75 of HBM3E, 1/4 of GDDR6",
    },
}

# What happens when you run LLM inference over CXL bandwidth
cxl_inference = {
    "7B_Q4_K_M (4.7GB weights)": {
        "reads_per_token": "4.7 GB",
        "cxl_speed": "64 / 4.7 = ~13.6 t/s",
        "hbm3e_speed": "4800 / 4.7 = ~1021 t/s",
        "gddr6_speed": "272 / 4.7 = ~57.9 t/s → measured 32 t/s (55% efficiency)",
        "verdict": "Weight reads from CXL are barely usable",
    },
    "70B_Q4_K_M (40GB weights)": {
        "reads_per_token": "40 GB",
        "cxl_speed": "64 / 40 = 1.6 t/s",
        "verdict": "Reading speed. Unusable",
    },
}

Loading 70B Q4 weights over CXL's 64 GB/s gives you 1.6 t/s. That's about human reading speed.

But CXL's real value isn't as a place to store weights.

# The right way to use CXL: tiered memory architecture
cxl_tiered_architecture = {
    "Tier 0 (GPU SRAM)": {
        "purpose": "Activations, work buffers",
        "capacity": "24 MB (RTX 4060 L2)",
        "bandwidth": "~4 TB/s (on-chip)",
    },
    "Tier 1 (HBM/GDDR)": {
        "purpose": "Model weights, active KV cache",
        "capacity": "8-141 GB",
        "bandwidth": "272-4800 GB/s",
    },
    "Tier 2 (CXL Memory)": {
        "purpose": "KV cache overflow, inactive layers",
        "capacity": "TB-class",
        "bandwidth": "64 GB/s",
        "latency": "170-400 ns",
    },
    "Tier 3 (NVMe SSD)": {
        "purpose": "Persistent model storage, swap",
        "capacity": "TB-class",
        "bandwidth": "7 GB/s (PCIe 4.0 x4)",
        "latency": "~10,000 ns",
    },
}

# CXL fills the gap between Tier 1 and Tier 3
# As a KV cache overflow target, it's 9x faster than NVMe
# The correct split: weights in VRAM, stale KV cache entries in CXL

CXL's essence isn't "VRAM replacement" — it's "a new tier between VRAM and NVMe." If you evict stale KV cache tokens (the early portion of a 128K context) to CXL memory, VRAM only needs to hold the recent attention window. That's a viable architecture.

This tiering is orthogonal to techniques like optical memory readout (physically reducing KV cache transfer volume) or KV cache quantization (numerically reducing data volume). They compose.

Unified Memory: The Balance Trap

Apple Silicon's Unified Memory lets the CPU, GPU, and NPU share a single physical memory pool.

# Apple Unified Memory in practice
unified_memory = {
    "M4 Max": {
        "capacity": "128 GB",
        "bandwidth": "546 GB/s (LPDDR5X)",
        "bus_width": "512-bit",
        "shared_by": "CPU (12 cores) + GPU (40 cores) + NPU (16 cores) + media engine",
    },
    "M4 (base)": {
        "capacity": "16-32 GB",
        "bandwidth": "120 GB/s",
        "bus_width": "128-bit",
        "note": "Less than half of RTX 4060 (272 GB/s)",
    },
}

# Reality of LLM inference
unified_memory_llm = {
    "M4 Max 128GB": {
        "advantage": "70B Q4_K_M (40GB) fits entirely without GPU memory management",
        "70B_speed": "546 / 40 = 13.7 t/s (theoretical ceiling) → measured 8-10 t/s",
        "reason_for_gap": "Bandwidth shared with CPU/NPU/IO. GPU doesn't get exclusive access",
    },
    "M4 32GB": {
        "32B_Q4_speed": "120 / 18 = 6.7 t/s (theoretical) → measured 4-5 t/s",
        "note": "RTX 4060 has exclusive 272 GB/s GDDR6 → 10.8 t/s on the same model",
    },
    "bandwidth_contention": {
        "cause": "CPU still accesses memory during GPU inference → they fight for bandwidth",
        "OS_overhead": "macOS memory management, UI rendering consume bandwidth in the background",
        "worst_case": "Running inference while Safari has a heavy page open → noticeable speed drop",
    },
}

Unified Memory's advantage is eliminating GPU memory management overhead. No cudaMalloc/cudaMemcpy. The data is already there. Zero copy cost.

But bandwidth is a shared resource — you can't monopolize it. The RTX 4060's GDDR6 gives 272 GB/s effectively exclusive to the GPU. The base M4 splits 120 GB/s across the entire system.

# Bandwidth efficiency comparison
bandwidth_efficiency = {
    "RTX 4060 (8GB GDDR6)": {
        "total_bw": 272,
        "gpu_share": "~95% (only DisplayPort output competing)",
        "effective_for_llm": "~258 GB/s",
        "7B_Q4_speed": "258 / 4.7 = 54.9 t/s (theoretical) → 32 t/s (58% effective)",
    },
    "M4 Max (128GB LPDDR5X)": {
        "total_bw": 546,
        "gpu_share": "Majority during inference (contention with CPU/NPU/IO)",
        "effective_for_llm": "~400 GB/s (back-calculated: 8-10 t/s x 40GB = 320-400 GB/s)",
        "70B_Q4_speed": "400 / 40 = 10 t/s → measured 8-10 t/s (roughly matches)",
    },
    "M4 base (16GB LPDDR5X)": {
        "total_bw": 120,
        "gpu_share": "Shared with entire system during inference",
        "effective_for_llm": "~78 GB/s (back-calculated: 14-16 t/s x 4.7GB = 66-75 GB/s)",
        "7B_Q4_speed": "78 / 4.7 = 16.6 t/s → measured 14-16 t/s",
    },
}

# RTX 4060: Lower bandwidth but GPU-exclusive → fast on small models
# M4 Max: Higher bandwidth but shared → fits large models at the cost of per-bandwidth efficiency
# M4 base: Mediocre bandwidth and capacity → loses to RTX 4060 for LLM workloads

Comparing the Three Approaches

                Bandwidth    Capacity    Cost        Role in LLM Inference
─────────────────────────────────────────────────────────────────
HBM3E          4,800 GB/s    141 GB     $10-15/GB   Read weights + KV at full speed
GDDR6         272 GB/s      8-24 GB    $2.5-4/GB   Run small models fast
CXL 3.1        64 GB/s       TB-class   $3-5/GB     KV cache overflow tier
Unified (Max)  546 GB/s      128 GB     Apple-set   Fit large models with zero-copy
NVMe SSD       7 GB/s        TB-class   $0.1/GB     Persistent model storage

# Optimal use case for each technology
optimal_scenarios = {
    "HBM (H100/H200)": {
        "scenario": "Batch inference, concurrent request processing",
        "why": "Bandwidth is amortized across multiple requests, making per-request cost efficient",
        "limitation": "For a single request, most of 700W is wasted",
    },
    "GDDR (RTX 4060/5060)": {
        "scenario": "Personal use, single request, small-to-mid models",
        "why": "Exclusive GPU bandwidth maximizes efficiency. 32 t/s at ~70W (0.46 t/s/W). Beats H100 single-request power efficiency for small models",
        "limitation": "Capacity wall. 8GB means 7B is the ceiling",
    },
    "CXL": {
        "scenario": "Ultra-long context inference (128K+), shared memory pools",
        "why": "Solves VRAM exhaustion when KV cache balloons to tens of GB with long contexts",
        "limitation": "1/75 bandwidth. Too slow for weight storage",
        "timeline": "Server-side 2025-26, consumer 2028+",
    },
    "Unified Memory (Apple)": {
        "scenario": "Running large models with minimal setup. Development and experimentation",
        "why": "70B Q4 runs without any memory management. Ease of setup",
        "limitation": "Shared bandwidth means lower speed efficiency vs exclusive GDDR. Hard to share with gaming workloads",
    },
}

Practical Implications for 8GB VRAM Users

# Strategies for breaking through the memory wall with 8GB VRAM today
practical_8gb = {
    "Layer 1: Quantization (immediate impact)": {
        "method": "Q4_K_M quantization",
        "effect": "7B model weights: 14GB → 4.7GB (3x capacity efficiency)",
        "how": "Standard support in llama.cpp / Ollama",
    },
    "Layer 2: KV cache quantization (experimental)": {
        "method": "--cache-type-k q4_0 --cache-type-v q8_0",
        "effect": "KV cache at 1/3 of FP16 → enables longer contexts",
        "how": "llama.cpp launch flags",
    },
    "Layer 3: CPU offload (bandwidth tradeoff)": {
        "method": "--n-gpu-layers to partially load onto GPU",
        "effect": "32B models run (slow, but they run)",
        "speed": "10.8 t/s (32B on RTX 4060, optimal offload)",
        "bandwidth_bottleneck": "CPU↔GPU via PCIe 4.0 x8 = 16 GB/s",
    },
    "Layer 4: CXL (future)": {
        "method": "CXL memory modules",
        "effect": "Add memory via PCIe → Tier 2 storage for KV cache",
        "timeline": "Consumer availability 2028+",
        "note": "Similar in principle to today's CPU offload (PCIe 16 GB/s), but CXL allows memory-semantic access (load/store, directly addressable by GPU)",
    },
}

# What you can do today: combine Layers 1-3
# Q4 quantization + KV cache Q4 + optimal GPU offload = 32B model × 32K context on 8GB
# Future: CXL adds Layer 4, making 128K+ contexts realistic

Here's the key insight: the "memory expansion" CXL promises travels over fundamentally the same PCIe bus as today's CPU offload. The bandwidth ceiling is identical. CXL's advantage is memory semantics — load/store access where the GPU can address memory directly — not bandwidth improvement.

The Physics That Decides Memory's Future

Question: "Does adding more VRAM solve the problem?"

Answer: Only partially.

The bandwidth-capacity-cost triangle is governed by physics,
and no technology can claim all three vertices.

HBM chose bandwidth, sacrificing capacity and cost.
CXL chose capacity, sacrificing bandwidth.
Unified Memory chose balance, sacrificing exclusive bandwidth.
GDDR chose exclusive bandwidth, sacrificing capacity.

The optimal answer for LLM inference isn't "pick one technology" —
it's combining multiple technologies in a tiered hierarchy.

Best strategy for today's RTX 4060:
  Weights → VRAM (Q4 quantization to fit 7-13B entirely)
  KV cache → VRAM (Q4/Q8 quantization to save capacity)
  Extra layers → RAM (CPU offload, PCIe bandwidth)
  Persistent storage → NVMe SSD

Best strategy for future CXL-equipped consumer PCs:
  Weights → VRAM (Q4 quantization)
  Active KV → VRAM
  Stale KV → CXL memory (64 GB/s is fast enough for this)
  Persistent storage → NVMe SSD

The memory wall isn't something you break through —
it's something you route around with tiers.

References

CXL Consortium — "Compute Express Link Specification 3.1" (2024)
Samsung — "CMM-D: CXL Memory Module for Data Centers" (2024)
SK hynix — HBM3E specifications, 12-Hi stack architecture
NVIDIA H200 specifications — 141GB HBM3E, 4.8 TB/s
Apple M4 Max specifications — 128GB Unified Memory, 546 GB/s
"Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023) arXiv:2309.06180