Kotcherla Murali Krishna

Posted on Jun 9

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

#ai #tensorflow #llm #python

A deep dive into memory fragmentation, paged memory management, and why PagedAttention can deliver up to 24× higher throughput than conventional KV cache implementations.

Every token you generate during LLM inference silently eats GPU memory. With traditional KV caching, a significant portion of that memory is wasted — never used, never reclaimed. vLLM’s PagedAttention changed that by borrowing a decades-old idea from operating systems. Here’s exactly how it works and why it matters.

What Is a KV Cache and Why Does It Exist?
The Problem: Traditional KV Cache and Memory Fragmentation
Inspiration from OS Virtual Memory — The vLLM Insight
PagedAttention: How It Works
Memory Fragmentation: Before vs After
Throughput Gains: Numbers and Benchmarks
Trade-offs and Limitations
Who Should Care About This?
Key Takeaways

Section 1 — What Is a KV Cache and Why Does It Exist?

What to cover:

During autoregressive decoding, each new token attends to all previous tokens
Without caching: recompute keys/values for all past tokens at every step → O(n²) compute cost
With KV cache: store key/value tensors per layer per token in GPU memory → amortize the attention cost
KV cache size formula: 2 × num_layers × num_heads × head_dim × seq_len × batch_size × dtype_bytes

Code snippet to include:

# Rough KV cache size estimation
num_layers = 32 # LLaMA-2 7B
num_heads = 32
head_dim = 128 # typically d_model / num_heads
seq_len = 2048
batch_size = 8
dtype_bytes = 2 # float16

kv_cache_bytes = (
    2 * num_layers * num_heads * head_dim * seq_len * batch_size * dtype_bytes
)
print(f"KV cache: {kv_cache_bytes / 1e9:.2f} GB")
# Output: KV cache: 8.59 GB ← just for 8 sequences at 2K context

Key insight to land:

The KV cache is not a nice-to-have — it’s essential for serving LLMs at any reasonable speed. But the way it was traditionally allocated is deeply wasteful.

Section 2 — The Problem: Traditional KV Cache and Memory Fragmentation

What to cover:

Pre-allocation problem:

Traditional frameworks pre-allocate a contiguous block of GPU memory per sequence based on the maximum possible sequence length
A request with max_seq_len=2048 gets 2048 slots — even if it only generates 50 tokens
Wasted memory = (max_seq_len - actual_len) × per_token_kv_size

Three types of fragmentation:

Three types of fragmentation

Real-world consequence:

GPU shows 80% memory used → only 30–40% is actually holding useful KV data
Batching becomes impossible because you can’t fit more sequences even though memory “exists”
High latency under load → queuing, not computation, becomes your bottleneck

Diagram suggestion:

Traditional KV Cache

Section 3 — Inspiration from OS Virtual Memory: The vLLM Insight

What to cover:

The vLLM team (Kwon et al., 2023) drew a direct parallel to how OS manages RAM via paging
In OS paging: physical RAM is divided into fixed-size “frames”; processes get “pages” mapped to frames via a page table
Physical memory can be non-contiguous; the virtual address space appears contiguous to the process
Key OS insight: don’t allocate memory until you actually need it (demand paging)

The analogy table:

The analogy table

Quote worth referencing:

The vLLM paper describes PagedAttention as managing the KV cache “like virtual memory in an OS, with paging.” — Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention_, 2023._

Section 4 — PagedAttention: How It Works

What to cover:

Block-based KV cache:

GPU memory is divided into fixed-size KV blocks (e.g., 16 tokens per block)
Each block holds key/value tensors for exactly block_size tokens across all layers
A block table maps each sequence to a list of physical block indices — like a page table

Allocation lifecycle:

Request arrives → zero blocks allocated
First block_size tokens generated → one block allocated
Next block_size tokens → second block allocated (can be physically non-contiguous)
Request completes → blocks freed immediately, returned to pool

Prompt sharing (copy-on-write):

Multiple requests with identical prefixes (e.g., a system prompt) can share the same physical KV blocks
On divergence (beam search branch, parallel sampling), blocks are copied — copy-on-write semantics
Massive memory savings in high-traffic deployments with shared system prompts

Code snippet (simplified block table concept):

from dataclasses import dataclass, field
from typing import List, Optional

BLOCK_SIZE = 16 # tokens per block

@dataclass
class KVBlock:
    block_id: int
    token_count: int = 0
    is_shared: bool = False

@dataclass
class SequenceState:
    seq_id: int
    block_table: List[int] = field(default_factory=list) # physical block IDs
    total_tokens: int = 0

class PagedKVManager:
    def __init__ (self, total_blocks: int):
        self.free_blocks = list(range(total_blocks))
        self.blocks = {i: KVBlock(block_id=i) for i in range(total_blocks)}

    def allocate_block(self) -> Optional[int]:
        if not self.free_blocks:
            return None # OOM — trigger preemption
        return self.free_blocks.pop()

    def append_token(self, seq: SequenceState) -> bool:
        """Allocate a new block if the current one is full."""
        needs_new_block = (
            not seq.block_table or
            self.blocks[seq.block_table[-1]].token_count == BLOCK_SIZE
        )
        if needs_new_block:
            block_id = self.allocate_block()
            if block_id is None:
                return False # signal OOM
            seq.block_table.append(block_id)

        current_block = self.blocks[seq.block_table[-1]]
        current_block.token_count += 1
        seq.total_tokens += 1
        return True

    def free_sequence(self, seq: SequenceState):
        for block_id in seq.block_table:
            self.blocks[block_id].token_count = 0
            self.free_blocks.append(block_id)
        seq.block_table.clear()

Section 5 — Memory Fragmentation: Before vs After

What to cover:

Traditional system waste breakdown (typical):

Internal fragmentation: ~20–30% (unused reserved slots within sequences)
External fragmentation: ~10–15% (gaps between sequence allocations)
Effective utilization: often 60–70% at best

PagedAttention waste profile:

Internal fragmentation: only within the last block of each sequence → max block_size - 1 tokens wasted per sequence
External fragmentation: near zero — all blocks are equal-sized, pool is homogeneous
Effective utilization: typically 90–96%

Fragmentation comparison diagram:

Fragmentation comparision diagram

Section 6 — Throughput Gains: Numbers and Benchmarks

What to cover:

From the vLLM paper (Kwon et al., 2023):

vLLM achieves 2–4× higher throughput vs HuggingFace Transformers on the same hardware
vs Orca (a continuous batching baseline without PagedAttention): up to 1.7× higher throughput
At high request rates with long sequences, gains can reach up to 24× over naive implementations

Why throughput improves:

Higher GPU memory utilization → more sequences in a batch simultaneously
Less queuing → GPU stays busy rather than stalling waiting for memory
Prompt sharing → system-prompt KV computed once, shared across N requests
Faster preemption → on OOM, swap individual blocks not entire sequences

Throughput table (representative, based on vLLM paper figures):

Throughput table

Note: Actual numbers vary by model size, GPU type, sequence length distribution, and request arrival pattern. Always benchmark on your workload.

Code snippet — measuring throughput with vLLM:

from vllm import LLM, SamplingParams
import time

llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    gpu_memory_utilization=0.90, # PagedAttention uses 90% of GPU VRAM
    max_num_batched_tokens=8192,
)

prompts = ["Explain quantum entanglement in simple terms."] * 100
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

start = time.time()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.time() - start

total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
print(f"Throughput: {len(prompts)/elapsed:.2f} req/s")
print(f"Token throughput: {total_tokens/elapsed:.0f} tok/s")

Section 7 — Trade-offs and Limitations

What to cover (balanced, Medium readers appreciate honesty):

What you gain:

Dramatically higher memory utilization
Much better batch concurrency → lower latency at scale
Prefix caching out of the box in modern vLLM versions

What you trade or accept:

Block table overhead : small CPU-side memory + lookup cost per attention step
Block size tuning : too small → excessive block table management; too large → internal fragmentation creeps back
Attention kernel complexity : standard Flash Attention doesn’t natively support paged KV; vLLM ships custom CUDA kernels
Not a magic fix for compute : PagedAttention solves the memory bottleneck — if you’re compute-bound (long context, large model), gains are smaller
Prefill is unaffected : PagedAttention mainly helps the decode phase; prefill still processes the full prompt sequentially

Block size sensitivity:

# Approximate internal fragmentation vs block size
def max_waste_fraction(block_size, avg_seq_len):
    # Worst case: last block is almost empty
    max_wasted_per_seq = block_size - 1
    return max_wasted_per_seq / avg_seq_len

for bs in [8, 16, 32, 64]:
    waste = max_waste_fraction(bs, avg_seq_len=256)
    print(f"block_size={bs:3d}: max waste = {waste*100:.1f}%")

# block_size= 8: max waste = 2.7%
# block_size= 16: max waste = 5.9%
# block_size= 32: max waste = 12.1%
# block_size= 64: max waste = 24.6%

vLLM defaults to block_size=16 as a practical sweet spot.

Section 8 — Who Should Care About This?

Target reader callouts:

If you’re serving LLMs in production:

This is the single most impactful infrastructure change for throughput under concurrent load
Switch to vLLM or an equivalent (SGLang, TensorRT-LLM with paged cache support)

If you’re building on top of HuggingFace Transformers:

Great for research/single-user; not designed for high-concurrency serving
The KV cache in model.generate() uses contiguous pre-allocation — fine for one request, fragmentation-heavy at scale

If you’re doing fine-tuning or evaluation:

PagedAttention is a serving optimization; during training, you typically don’t cache KV across steps the same way
Still useful to understand for when you deploy your fine-tuned model

If you’re reading vLLM/SGLang source code:

BlockSpaceManager, BlockAllocator, and Scheduler are the core classes to study
The block table is maintained per-sequence in the scheduler, not inside the model forward pass

Section 9 — Key Takeaways

Traditional KV cache pre-allocates contiguous memory per sequence based on max length → 30–40% typical waste
Memory fragmentation — internal, external, and reservation — is the root cause of low GPU utilization and poor batching in naive LLM serving
PagedAttention borrows OS virtual memory paging : fixed-size blocks, on-demand allocation, block tables instead of contiguous buffers
Result: 90–96% memory utilization , enabling more sequences per batch and dramatically higher throughput
vLLM demonstrated 2–24× throughput gains depending on workload and baseline compared
Key trade-offs : custom CUDA kernels required, block size tuning matters, compute-bound workloads see smaller benefits
Prompt sharing (copy-on-write) is an underrated bonus: shared prefixes are computed once, referenced many times

DEV Community

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

Table of Contents

Section 1 — What Is a KV Cache and Why Does It Exist?

What to cover:

Code snippet to include:

Key insight to land:

Section 2 — The Problem: Traditional KV Cache and Memory Fragmentation

What to cover:

Three types of fragmentation:

Diagram suggestion:

Section 3 — Inspiration from OS Virtual Memory: The vLLM Insight

What to cover:

The analogy table:

Quote worth referencing:

Section 4 — PagedAttention: How It Works

What to cover:

Section 5 — Memory Fragmentation: Before vs After

What to cover:

Fragmentation comparison diagram:

Section 6 — Throughput Gains: Numbers and Benchmarks

What to cover:

Throughput table (representative, based on vLLM paper figures):

Code snippet — measuring throughput with vLLM:

Section 7 — Trade-offs and Limitations

What to cover (balanced, Medium readers appreciate honesty):

Block size sensitivity:

Section 8 — Who Should Care About This?

Target reader callouts:

Section 9 — Key Takeaways

Top comments (0)