What "Subquadratic Attention" Actually Means

#ai #llm #explained #transformers

SubQ launched on May 5, 2026 with a 12 million token context window and a claim worth slowing down on: the first commercial frontier LLM that isn't built on quadratic attention. The phrase has been on every feed since. Most of the posts about it don't define what subquadratic actually means, or how SubQ's approach differs from Mamba and RWKV, both of which have been chasing the same goal for two years.

TL;DR

Quadratic attention means doubling the context quadruples the compute. That is the wall every long-context model hits.
There are three ways the field has tried to break the wall: fixed-pattern sparse attention (Longformer, BigBird), state-space recurrence (Mamba, RWKV), and learned-sparse attention. SubQ's SSA is the third.
A 12M context window is roughly 60x what GPT-4o-class models gave you a year ago. The honest first use cases are whole codebases, long agent traces, and document review — not "fits an entire book".

Why the wall exists

Standard transformer attention computes a similarity score between every pair of tokens in the input. For an input of n tokens, that's n × n comparisons. At n = 1,000 that's a million. At n = 1,000,000 it's a trillion. Memory bandwidth saturates before raw FLOPs do, which is why FlashAttention helped — it tiled the work to be cache-friendly. But FlashAttention didn't change the n² part of the equation. The compute still scales quadratically; it just runs faster on the same hardware.

A quick numeric example, since it's easier to feel the scaling than describe it:

# Naive cost model. SSA's real cost is between O(n) and O(n log n).
def quadratic_ops(n: int) -> int:
    return n * n

def linear_ops(n: int) -> int:
    return n

# Going from a 200K context to a 12M context:
print(quadratic_ops(200_000))    # 40,000,000,000      -> 40 billion
print(quadratic_ops(12_000_000)) # 144,000,000,000,000 -> 144 trillion (3,600x more)
print(linear_ops(12_000_000))    # 12,000,000          -> 60x more, not 3,600x

That gap — 3,600x versus 60x — is the entire business case for subquadratic attention.

The three ways to break the wall

Fixed-pattern sparse attention

Don't compute every pair. Pick a sparsity pattern ahead of time: a local window plus a few global tokens (Longformer), block-sparse (BigBird), strided (Sparse Transformers). This works because most token pairs in real text don't influence each other meaningfully. The cost: you have to pick the structure before training, and you're guessing about future inputs.

State-space models

Replace attention entirely. Compress the past into a fixed-size hidden state that updates one token at a time. Mamba, Mamba-2, RWKV, and RetNet all sit here. Linear in context length by construction. The cost: a fixed state is lossy. On published needle-in-a-haystack benchmarks, recall on specific facts buried 100K tokens back is consistently weaker than full attention.

Learned-sparse attention

Keep the attention operation. But learn which pairs are worth computing for a given input, per query, at inference time. SubQ's SSA — Subquadratic Selective Attention — sits here. The sparsity pattern isn't baked into the architecture; it's chosen dynamically. On the RULER 128K benchmark, Subquadratic reports 95% accuracy at a cost of about $8 per run, compared to roughly $2,600 for Claude Opus on the same workload. Same accuracy band, ~300x less spent on compute.

The framing matters: SSA is closer in spirit to "let the model decide what to ignore" than to "ignore on a fixed schedule" or "don't do attention at all".

What 12M context buys a builder

Not "feed the whole book in once". That's the marketing version. The honest first use cases:

Whole codebases. A medium-sized repo is 2–5M tokens with comments and tests. It now fits without a retrieval layer.
Long agent traces. A four-hour agent run with tool outputs is easily 1–3M tokens. You can replay the entire trace as context for the next step instead of summarizing.
Document review at scale. A merger data room is 8–15M tokens. You can ask one question against the whole set without chunking and re-ranking.

What 12M context doesn't buy you is better reasoning over those tokens. A longer window means the model sees more; whether it can actually find and use the relevant fact is a different problem. RULER measures exactly this, and 95% is impressive but not 100%.

Caveats and open questions

The "1,000x compute reduction" number at full 12M context is the company's claim, not independently reproduced. Researchers quoted in VentureBeat's coverage are asking for third-party benchmarks before believing it. Wait for a paper.
SSA at 12M tokens is private beta. Public API pricing isn't set, and the published RULER number is at 128K, not 12M.
We don't know how SSA holds up on tasks where every token matters — formal verification, long code traces with subtle dependencies, multi-step proofs. Linear-or-near-linear models historically lose accuracy on these. SubQ's whole pitch is "we keep accuracy", but the public evidence is one benchmark at one context length.

If the numbers hold up under outside scrutiny, "1M context" stops being a flagship feature and becomes the floor. Either way, knowing why the wall exists and which family of fixes a given vendor is using is now table stakes.

Read The New Stack's writeup and SiliconANGLE's launch coverage for the announcement details.