A deep dive into memory fragmentation, paged memory management, and why PagedAttention can deliver up to 24× higher throughput than conventional KV cache implementations.
Every token you generate during LLM inference silently eats GPU memory. With traditional KV caching, a significant portion of that memory is wasted — never used, never reclaimed. vLLM’s PagedAttention changed that by borrowing a decades-old idea from operating systems. Here’s exactly how it works and why it matters.
Table of Contents
- What Is a KV Cache and Why Does It Exist?
- The Problem: Traditional KV Cache and Memory Fragmentation
- Inspiration from OS Virtual Memory — The vLLM Insight
- PagedAttention: How It Works
- Memory Fragmentation: Before vs After
- Throughput Gains: Numbers and Benchmarks
- Trade-offs and Limitations
- Who Should Care About This?
- Key Takeaways
Section 1 — What Is a KV Cache and Why Does It Exist?
What to cover:
- During autoregressive decoding, each new token attends to all previous tokens
- Without caching: recompute keys/values for all past tokens at every step → O(n²) compute cost
- With KV cache: store key/value tensors per layer per token in GPU memory → amortize the attention cost
- KV cache size formula: 2 × num_layers × num_heads × head_dim × seq_len × batch_size × dtype_bytes
Code snippet to include:
# Rough KV cache size estimation
num_layers = 32 # LLaMA-2 7B
num_heads = 32
head_dim = 128 # typically d_model / num_heads
seq_len = 2048
batch_size = 8
dtype_bytes = 2 # float16
kv_cache_bytes = (
2 * num_layers * num_heads * head_dim * seq_len * batch_size * dtype_bytes
)
print(f"KV cache: {kv_cache_bytes / 1e9:.2f} GB")
# Output: KV cache: 8.59 GB ← just for 8 sequences at 2K context
Key insight to land:
The KV cache is not a nice-to-have — it’s essential for serving LLMs at any reasonable speed. But the way it was traditionally allocated is deeply wasteful.
Section 2 — The Problem: Traditional KV Cache and Memory Fragmentation
What to cover:
Pre-allocation problem:
- Traditional frameworks pre-allocate a contiguous block of GPU memory per sequence based on the maximum possible sequence length
- A request with max_seq_len=2048 gets 2048 slots — even if it only generates 50 tokens
- Wasted memory = (max_seq_len - actual_len) × per_token_kv_size
Three types of fragmentation:
Real-world consequence:
- GPU shows 80% memory used → only 30–40% is actually holding useful KV data
- Batching becomes impossible because you can’t fit more sequences even though memory “exists”
- High latency under load → queuing, not computation, becomes your bottleneck
Diagram suggestion:
Section 3 — Inspiration from OS Virtual Memory: The vLLM Insight
What to cover:
- The vLLM team (Kwon et al., 2023) drew a direct parallel to how OS manages RAM via paging
- In OS paging: physical RAM is divided into fixed-size “frames”; processes get “pages” mapped to frames via a page table
- Physical memory can be non-contiguous; the virtual address space appears contiguous to the process
- Key OS insight: don’t allocate memory until you actually need it (demand paging)
The analogy table:
Quote worth referencing:
The vLLM paper describes PagedAttention as managing the KV cache “like virtual memory in an OS, with paging.” — Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention_, 2023._
Section 4 — PagedAttention: How It Works
What to cover:
Block-based KV cache:
- GPU memory is divided into fixed-size KV blocks (e.g., 16 tokens per block)
- Each block holds key/value tensors for exactly block_size tokens across all layers
- A block table maps each sequence to a list of physical block indices — like a page table
Allocation lifecycle:
- Request arrives → zero blocks allocated
- First block_size tokens generated → one block allocated
- Next block_size tokens → second block allocated (can be physically non-contiguous)
- Request completes → blocks freed immediately, returned to pool
Prompt sharing (copy-on-write):
- Multiple requests with identical prefixes (e.g., a system prompt) can share the same physical KV blocks
- On divergence (beam search branch, parallel sampling), blocks are copied — copy-on-write semantics
- Massive memory savings in high-traffic deployments with shared system prompts
Code snippet (simplified block table concept):
from dataclasses import dataclass, field
from typing import List, Optional
BLOCK_SIZE = 16 # tokens per block
@dataclass
class KVBlock:
block_id: int
token_count: int = 0
is_shared: bool = False
@dataclass
class SequenceState:
seq_id: int
block_table: List[int] = field(default_factory=list) # physical block IDs
total_tokens: int = 0
class PagedKVManager:
def __init__ (self, total_blocks: int):
self.free_blocks = list(range(total_blocks))
self.blocks = {i: KVBlock(block_id=i) for i in range(total_blocks)}
def allocate_block(self) -> Optional[int]:
if not self.free_blocks:
return None # OOM — trigger preemption
return self.free_blocks.pop()
def append_token(self, seq: SequenceState) -> bool:
"""Allocate a new block if the current one is full."""
needs_new_block = (
not seq.block_table or
self.blocks[seq.block_table[-1]].token_count == BLOCK_SIZE
)
if needs_new_block:
block_id = self.allocate_block()
if block_id is None:
return False # signal OOM
seq.block_table.append(block_id)
current_block = self.blocks[seq.block_table[-1]]
current_block.token_count += 1
seq.total_tokens += 1
return True
def free_sequence(self, seq: SequenceState):
for block_id in seq.block_table:
self.blocks[block_id].token_count = 0
self.free_blocks.append(block_id)
seq.block_table.clear()
Section 5 — Memory Fragmentation: Before vs After
What to cover:
Traditional system waste breakdown (typical):
- Internal fragmentation: ~20–30% (unused reserved slots within sequences)
- External fragmentation: ~10–15% (gaps between sequence allocations)
- Effective utilization: often 60–70% at best
PagedAttention waste profile:
- Internal fragmentation: only within the last block of each sequence → max block_size - 1 tokens wasted per sequence
- External fragmentation: near zero — all blocks are equal-sized, pool is homogeneous
- Effective utilization: typically 90–96%
Fragmentation comparison diagram:

Fragmentation comparision diagram
Section 6 — Throughput Gains: Numbers and Benchmarks
What to cover:
From the vLLM paper (Kwon et al., 2023):
- vLLM achieves 2–4× higher throughput vs HuggingFace Transformers on the same hardware
- vs Orca (a continuous batching baseline without PagedAttention): up to 1.7× higher throughput
- At high request rates with long sequences, gains can reach up to 24× over naive implementations
Why throughput improves:
- Higher GPU memory utilization → more sequences in a batch simultaneously
- Less queuing → GPU stays busy rather than stalling waiting for memory
- Prompt sharing → system-prompt KV computed once, shared across N requests
- Faster preemption → on OOM, swap individual blocks not entire sequences
Throughput table (representative, based on vLLM paper figures):
Note: Actual numbers vary by model size, GPU type, sequence length distribution, and request arrival pattern. Always benchmark on your workload.
Code snippet — measuring throughput with vLLM:
from vllm import LLM, SamplingParams
import time
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
gpu_memory_utilization=0.90, # PagedAttention uses 90% of GPU VRAM
max_num_batched_tokens=8192,
)
prompts = ["Explain quantum entanglement in simple terms."] * 100
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
start = time.time()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.time() - start
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
print(f"Throughput: {len(prompts)/elapsed:.2f} req/s")
print(f"Token throughput: {total_tokens/elapsed:.0f} tok/s")
Section 7 — Trade-offs and Limitations
What to cover (balanced, Medium readers appreciate honesty):
What you gain:
- Dramatically higher memory utilization
- Much better batch concurrency → lower latency at scale
- Prefix caching out of the box in modern vLLM versions
What you trade or accept:
- Block table overhead : small CPU-side memory + lookup cost per attention step
- Block size tuning : too small → excessive block table management; too large → internal fragmentation creeps back
- Attention kernel complexity : standard Flash Attention doesn’t natively support paged KV; vLLM ships custom CUDA kernels
- Not a magic fix for compute : PagedAttention solves the memory bottleneck — if you’re compute-bound (long context, large model), gains are smaller
- Prefill is unaffected : PagedAttention mainly helps the decode phase; prefill still processes the full prompt sequentially
Block size sensitivity:
# Approximate internal fragmentation vs block size
def max_waste_fraction(block_size, avg_seq_len):
# Worst case: last block is almost empty
max_wasted_per_seq = block_size - 1
return max_wasted_per_seq / avg_seq_len
for bs in [8, 16, 32, 64]:
waste = max_waste_fraction(bs, avg_seq_len=256)
print(f"block_size={bs:3d}: max waste = {waste*100:.1f}%")
# block_size= 8: max waste = 2.7%
# block_size= 16: max waste = 5.9%
# block_size= 32: max waste = 12.1%
# block_size= 64: max waste = 24.6%
vLLM defaults to block_size=16 as a practical sweet spot.
Section 8 — Who Should Care About This?
Target reader callouts:
If you’re serving LLMs in production:
- This is the single most impactful infrastructure change for throughput under concurrent load
- Switch to vLLM or an equivalent (SGLang, TensorRT-LLM with paged cache support)
If you’re building on top of HuggingFace Transformers:
- Great for research/single-user; not designed for high-concurrency serving
- The KV cache in model.generate() uses contiguous pre-allocation — fine for one request, fragmentation-heavy at scale
If you’re doing fine-tuning or evaluation:
- PagedAttention is a serving optimization; during training, you typically don’t cache KV across steps the same way
- Still useful to understand for when you deploy your fine-tuned model
If you’re reading vLLM/SGLang source code:
- BlockSpaceManager, BlockAllocator, and Scheduler are the core classes to study
- The block table is maintained per-sequence in the scheduler, not inside the model forward pass
Section 9 — Key Takeaways
- Traditional KV cache pre-allocates contiguous memory per sequence based on max length → 30–40% typical waste
- Memory fragmentation — internal, external, and reservation — is the root cause of low GPU utilization and poor batching in naive LLM serving
- PagedAttention borrows OS virtual memory paging : fixed-size blocks, on-demand allocation, block tables instead of contiguous buffers
- Result: 90–96% memory utilization , enabling more sequences per batch and dramatically higher throughput
- vLLM demonstrated 2–24× throughput gains depending on workload and baseline compared
- Key trade-offs : custom CUDA kernels required, block size tuning matters, compute-bound workloads see smaller benefits
- Prompt sharing (copy-on-write) is an underrated bonus: shared prefixes are computed once, referenced many times




Top comments (0)