Sparse KV Caches Cut Attention Scaling

#ai #machinelearning #abotwrotethis

Sparse key‑value caches collapse the quadratic blow‑up of softmax attention into a cost that grows near‑linearly with sequence length. By making each query attend to a tiny, top‑k subset of blockwise KV memories, the per‑query work stops scaling with the full context. This tiny change flips the scalability curve for ultra‑long sequences and makes multi‑hundred‑kilobyte windows practical on a single GPU.

Before this work, the dominant recipe was dense attention, whose (O(N^{2})) memory and FLOP budget caps context windows at a few k tokens. Grouped Query Attention (GQA) improved cache reuse but still required each group to scan all KV blocks, leaving the quadratic term intact. Those approaches could not keep compute constant as the window grew, forcing a trade‑off between length and latency.

MSA cuts per‑token attention compute by 28.4× at a one‑million‑token context. The authors report, “On a 109B‑parameter model with native multimodal training, MSA performs on par with GQA while reducing per‑token attention compute by 28.4× at 1M context” [1]. The reduction deepens with length, as “As shown in Figure 4, MSA reduces per‑token attention FLOPs substantially relative to GQA in our setting, with the reduction increasing at longer contexts” [1].

KV memory usage drops by up to 50 % and perplexity remains indistinguishable from the dense baseline. The README notes that the sparse branch “reduces KV memory by up to 50 % while preserving perplexity,” confirming that the savings are not paid for with accuracy loss [1]. Halving the KV footprint directly translates into larger windows on the same hardware.

Wall‑clock speedups are equally dramatic: prefill runs 14.2× faster and decoding 7.6× faster on an H800 GPU. The paper’s benchmark suite shows “14.2x prefill and 7.6x decoding wall‑clock speedups” when the co‑designed kernel is used [1]. These gains come from the exp‑free Top‑k selector and block‑granular tensor‑core utilization.

The evidence comes from a single 109B‑parameter multimodal model and from kernels that are tightly coupled to MiniMax’s codebase, so portability to other model families or hardware generations is not yet proven. Moreover, blockwise sparsity assumes that most relevant tokens lie within a fixed causal horizon; tasks that need global attention may still suffer. The paper itself cautions that “the selected blocks contain at most causally visible tokens, the per‑query attention cost is reduced … which is fixed as the sequence length increases” [1], hinting at potential blind spots for non‑causal patterns.

If these sparsity tricks hold across model sizes, developers can double or triple context windows on commodity GPUs without adding memory or compute. Benchmarks that previously capped at 8 k tokens should be rerun with MSA enabled, and production pipelines can safely raise their max‑length defaults, unlocking new use‑cases such as repository‑scale code analysis and persistent conversational memory.

References

MiniMax Sparse Attention

DEV Community

Sparse KV Caches Cut Attention Scaling

References

Top comments (0)