PagedAttention in vLLM: KV Cache Paging for 24x Throughput

#pagedattention #vllm #kvcache #llminference

Why Memory Fragmentation Kills LLM Serving Throughput

Here's a number that should make you uncomfortable: existing LLM serving systems waste 60-80% of their GPU memory on KV cache fragmentation. That's not a typo. When you're paying $3/hour for an A100, you're effectively burning $1.80-$2.40 of that on empty memory blocks that can't be used.

The PagedAttention paper from Kwon et al. (SOSP 2023) tackles this directly. You can read the full paper here. The headline claim—24x throughput improvement over HuggingFace Transformers—sounds like marketing hype until you understand what's actually happening under the hood.

Moody exterior of a traditional restaurant in autumn with colorful foliage. — Photo by Nikola Kojević on Pexels

The KV Cache Problem Nobody Talks About

When a Transformer generates tokens, it needs to store the Key and Value vectors from all previous positions. For a 13B parameter model like LLaMA-13B, each token requires about 800KB of KV cache storage. A single sequence with 2048 tokens needs 1.6GB just for KV cache—and that's per sequence.

Continue reading the full article on TildAlice