pueding

Posted on Jun 12 • Originally published at learnaivisually.com

MiniMax M3 Ships Open-Weight 1M Context: MiniMax Sparse Attention (MSA)

#ai #llm #machinelearning #tutorial

What: The MiniMax M3 release — an open-weight model with a 1M-token context and 59% on SWE-Bench Pro — is built on MiniMax Sparse Attention (MSA), a block-sparse attention that gathers only the slices of the cached past each token actually needs.

Why: At a million tokens, ordinary attention's cost grows with the square of the length. MSA reportedly cuts per-token compute about 20× and delivers >9× faster prefill and >15× faster decode — the kind of serving efficiency that helps a 1M-context model ship with open weights.

vs prior: Where dense (full) attention compares every token against every earlier token, MSA partitions the KV cache into blocks and selects only the relevant ones — and its authors say it partitions more precisely than earlier sparse schemes like DSA or MoBA, while matching full attention on the vast majority of capabilities.

Think of it as

a librarian who fetches only the few relevant shelves

                       ONE QUERY
                           │
              ┌────────────┴────────────┐
              │                         │
      ┌───────▼───────┐         ┌───────▼───────┐
      │ DENSE (full)  │         │  MSA (MiniMax)│
      │ attention     │         │  block gather │
      └───────┬───────┘         └───────┬───────┘
              │                         │
     re-reads every book        glances at the labels,
     in the whole library       pulls a few shelves
     for every question         that bear on the query
              │                         │
              ▼                         ▼
       ✗ cost grows with         ✓ ~20x less compute
         the square of             at 1M tokens —
         the length                same answer

query token = a reader asking one question
KV cache = the whole library of everything read so far
KV block = one labeled shelf of related notes
dense attention = re-reading every book for every question
MSA block gather = pulling only the handful of shelves that matter

Quick glossary

MSA (MiniMax Sparse Attention) — MiniMax M3's attention mechanism. Instead of every token attending to every earlier token, it cuts the cached past into blocks, scores which blocks matter for each query, and computes attention over only the selected few.

KV cache — The stored Key and Value vectors for every token already processed, so the model never recomputes the past. It grows with context length — at 1M tokens it is enormous. Background: KV Cache → Memory Cost.

Dense (full) attention — The standard mechanism: each query compares against all earlier keys, so the work scales with the square of the sequence length (O(n²)). See Attention → Computing Attention Scores.

Block-sparse attention — Skipping most of the attention matrix on purpose. The keys are grouped into contiguous blocks; a lightweight selector keeps only the blocks a query needs and ignores the rest — so the model computes far fewer comparisons without retraining a different model class.

KV-outer gather Q — MiniMax's name for MSA's memory access pattern: for each query (Q), the engine gathers the selected outer KV blocks from cache before computing attention. It is a gather (strided) access pattern, not a dense sweep.

Prefill vs decode — Prefill reads the whole prompt in parallel; decode emits one token at a time. MSA reports separate speedups for each (>9× prefill, >15× decode) because they stress the hardware differently.

DSA / MoBA — Earlier block-sparse attention schemes (DeepSeek's sparse attention and Mixture-of-Block-Attention). MiniMax says MSA partitions the KV cache more precisely than both, keeping quality closer to full attention.

SWE-Bench Pro — A hard software-engineering benchmark: the model must resolve real GitHub issues end to end. M3 reportedly scores 59%, putting an open-weight model in frontier coding territory.

The news. On June 1, 2026, MiniMax released M3, an open-weight model that pairs frontier-level coding (59% on SWE-Bench Pro), a 1M-token context window, and native multimodality. The headline architecture change is MiniMax Sparse Attention (MSA) — a block-sparse attention the team reports cuts per-token compute about 20× at one million tokens. Read the release →

Picture the metaphor for a moment. A reader walks into a vast library — every note the model has ever taken sits on the shelves — and asks one question. The lazy approach is to re-read every book in the building before answering. That always works, but the effort grows brutally: double the library and you roughly quadruple the reading, because each new question also has to consider every new book. A good librarian doesn't do that. The notes are filed onto labeled shelves, the librarian glances at the labels, and pulls only the handful of shelves that actually bear on the question. Same answer, a fraction of the walking.

That is exactly the trade dense attention makes — and exactly the one MSA refuses. In a standard transformer, every token has to compare itself against every earlier token, so the attention work scales with the square of the sequence length. At a few thousand tokens nobody notices; at a million it dominates everything else the model does.

MiniMax Sparse Attention replaces the full sweep with a gather. It cuts the cached past into blocks — think of each block as one labeled shelf — scores which blocks are relevant to the current query, and computes attention over only the selected blocks. MiniMax calls the resulting memory pattern a "KV-outer gather Q": for each query, the engine gathers the chosen KV blocks instead of streaming the whole cache. The team reports this partitions the cache more precisely than earlier block-sparse schemes like DSA or MoBA, which is why M3 holds quality — it matches full attention on the vast majority of capabilities while skipping most of the comparisons.

Where the ~20× actually comes from

Hold the setup fixed and walk the arithmetic. Picture a query at the one-millionth token. Dense attention compares it against all 1,000,000 cached keys. MSA first groups those keys into blocks — say 128 keys each, so roughly 7,800 blocks (illustrative) — scores them, and keeps only the ones that matter. If the selector keeps about 5% of blocks (illustrative), the query now touches ~50,000 keys instead of 1,000,000 — a 20× drop in per-token comparisons, which lines up with the ~20× per-token compute cut MiniMax reports at 1M context. The savings show up twice in serving: >9× faster prefill (the prompt is read in one parallel pass) and >15× faster decode (each new token now gathers a few blocks instead of the whole cache).

How the attention variants compare

Approach	What each query looks at	Cost vs context length	Note
Dense (full) attention	every earlier token	grows with the square (n²)	the baseline; exact but expensive
Sliding-window	a fixed nearby window	linear, but drops far context	cheap; loses long-range recall
DSA / MoBA (block selection)	top-scored blocks	sub-quadratic	prior block-sparse schemes
MSA (MiniMax)	top-scored KV blocks, gathered	~20× less per-token compute at 1M (MiniMax; setup-dependent)	"partitions more precisely than DSA / MoBA"

A caveat worth keeping: the ~20× compute, >9× prefill, and >15× decode figures are MiniMax's own numbers at the 1M-context operating point, and sparse-attention speedups are setup-dependent — block size, how many blocks the selector keeps, sequence length, and the hardware all move them. The qualitative win (gather a few shelves, not the whole library) is the durable lesson; the exact multiplier is a reported headline, not a guarantee at every length.

Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores

Related explainers

IO-optimal approximate attention — near-linear IO — a different route to sub-quadratic attention: cut memory traffic rather than select blocks
Tangram — per-head KV cache budgets — another way to shrink long-context attention cost, by sizing each head's KV budget instead of selecting blocks
Parallax — local linear attention — the linear-attention alternative to the block-sparse approach MSA takes

FAQ

What is MiniMax Sparse Attention (MSA)?

MSA is the attention mechanism inside MiniMax's open-weight M3 model. Instead of having every token attend to every earlier token (dense attention, whose cost grows with the square of the sequence length), MSA partitions the KV cache into blocks, scores which blocks are relevant to each query, and computes attention over only the selected few — a "KV-outer gather Q" access pattern. MiniMax reports it cuts per-token compute about 20× at a 1M-token context while matching full attention on most capabilities.

Why does MSA matter?

Long context is the binding cost for modern LLMs: at a million tokens, dense attention dominates both compute and memory bandwidth. By gathering only the relevant KV blocks, MSA reportedly delivers more than 9× faster prefill and more than 15× faster decode at 1M context, and over 4× faster than Flash-Sparse-Attention. That kind of serving efficiency is what helps make a frontier-coding (59% SWE-Bench Pro), 1M-context model practical to ship with open weights.

How does MSA relate to DSA, MoBA, and the KV cache?

All three are block-sparse attention schemes that select a subset of the KV cache to attend to, rather than the whole thing. MiniMax says MSA partitions the cache more precisely than DSA (DeepSeek sparse attention) or MoBA (mixture of block attention), which is why it keeps quality closer to full attention. It sits one layer above the KV cache itself: the cache stores every token's Key and Value vectors, and MSA decides which blocks of that cache each query is allowed to read.

Originally posted on Learn AI Visually.

DEV Community