PagedAttention: How GPUs handle memory

#ai #python #machinelearning #gpu

Imagine you're running a high-end restaurant. When the first few guests arrive, service is lightning fast. But as the room fills up and guests start ordering complex, multi-course meals, the kitchen starts to lag.

The problem isn't that the chefs aren't skilled—it's that the counter space is cluttered. To keep track of a guest's order, the chef reserves a massive, dedicated tray for every single table, regardless of whether they ordered a 12-course tasting menu or just a glass of water. Most of those trays are empty, but they're taking up all the room in the kitchen, preventing new orders from being started.

In LLM systems, this "clutter" manifests as a spike in Time to First Token (TTFT). Now imagine it for a product with millions of users, this latency is a deal-breaker.

To understand the fix, we have to look at how LLMs actually "think" during inference.

The Two Phases of Response
When you hit enter on a long prompt, the system enters the Prefill Phase. The model processes your entire input to build a mathematical representation of your request. This is compute-bound; it's all about raw processing power.

Then comes the Decoding Phase. Because LLMs are autoregressive—generating one token at a time—the model must constantly look back at everything it has already processed to decide what comes next. This is memory-bound.

The KV Cache Problem
To avoid recalculating the entire conversation for every single new word, we use a KV Cache. It stores the "keys" and "values" of previous tokens in the GPU's VRAM.

Traditionally, the system reserves a contiguous block of VRAM based on the model's maximum context length. If the limit is 2048 tokens, it reserves space for 2048, even if you only typed "Hello."

This leads to Internal Fragmentation (wasted space inside the reserved block) and External Fragmentation (no single contiguous block large enough for a new request), severely limiting how many users a GPU can serve simultaneously.

Enter PagedAttention
Think of PagedAttention like a library archive. Instead of giving every researcher a massive, empty 10-volume set of binders and telling them to fill it linearly, the librarian gives them a "Table of Contents." As the researcher finds more information, the librarian gives them single, standardized index cards (pages) and stores them wherever there is a free slot in the cabinet. The researcher doesn't care where the cards are; they just use the Table of Contents to find them.

Why "Paged Attention" and not "Paged Allocation"?
While the mechanism is indeed about memory allocation, the reason it's called Paged Attention is that it fundamentally changes how the Attention mechanism accesses data.

In a standard Transformer, the attention mechanism expects the KV cache to be a continuous tensor (a long, unbroken strip of data). PagedAttention modifies the actual attention kernel to allow it to "hop" across non-contiguous memory blocks. It isn't just moving data around; it's changing how the model attends to its memory.

In high-scale systems, the bottleneck is rarely the algorithm itself, but how the algorithm manages the underlying hardware resources. By decoupling logical memory from physical storage, we transform a "demo" into a production-grade service.

DEV Community

PagedAttention: How GPUs handle memory

Top comments (0)