If you've ever shipped an LLM-powered feature that needed to reason over a real codebase, a real contract, or a real research corpus, you already know the shape of the problem. The model technically accepts a million tokens of context. In practice, the answers get worse as the context gets longer, and your infra bill gets worse faster than that.
SubQ is built around SSA — Subquadratic Sparse Attention — a linearly scaling attention mechanism designed for long-context retrieval, reasoning, and software engineering workloads. The technical results are strong on their own merits: 52.2× prefill speedup at 1M tokens, RULER 95.0%, MRCR v2 65.9%, SWE-Bench Verified 81.8%.
But the more interesting question is what happens to the industry if results like these stop being a one-off. The valuations, pricing, and competitive narrative around the major labs have been priced as if compute is the moat — as if maximizing token use and burning more dollars per call is the cost of doing business at the frontier. SSA is one of the first credible signals that this might not be true for much longer. And if it isn't, the OpenAIs and Anthropics of today look less like permanent fixtures and more like the Friendsters and MySpaces of the next platform shift.
The problem isn't "missing context." It's fragmented context.
The hard problems enterprise AI needs to solve are long-context problems. Codebases, contracts, enterprise corpora, databases, spreadsheets, research collections, and long-running agent sessions rarely fail because the answer is absent. They fail because the relevant evidence is distributed across a large body of context, referenced indirectly, and only meaningful when multiple pieces are held in view at once.
If you build with these systems, this list will look familiar:
- a codebase where a function is defined in one module, called in dozens of others, and constrained by tests elsewhere
- a contract where an obligation depends on a definition, an exception, and a referenced clause several pages apart
- a research workflow where a conclusion depends on reconciling evidence across many papers
- a long-running coding task where prior planning decisions, intermediate edits, review notes, and regressions all matter
These aren't lookup problems. They're multi-hop reasoning problems over fragmented corpora. And the workarounds we've been using — chunking, RAG, agentic decomposition, recursive summarization — all have the same shape. They preserve some signal and lose some signal. RAG keeps semantic similarity but loses position, hierarchy, neighboring context, and reference structure. Agentic workflows decompose tasks into smaller calls but compound errors across steps and bake hand-authored orchestration policy into the system. The bitter lesson keeps showing up: scaffolding that works today doesn't generalize tomorrow.
SSA is an attempt to remove more of the reason that scaffolding is necessary in the first place.
Why dense attention is the bottleneck
Attention is a retrieval operation built into the model. Each token acts as a query, compares itself against every other token, scores their relevance, and aggregates their information into its next representation. Powerful, because every token gets access to the full context. Expensive, for the exact same reason — every query compares against every key, and the cost grows quadratically with sequence length.
At small contexts this is fine. At hundreds of thousands to millions of tokens, it becomes the dominant constraint. Doubling context doesn't double cost; it quadruples it.
And here's the part that should bother any engineer: most of that work is wasted. In trained models, the vast majority of attention weights are near zero. The model performs the full all-pairs comparison, but only a small fraction of those interactions meaningfully influence the output. Dense attention isn't just quadratic — it's wastefully quadratic.
FlashAttention made this much more practical at today's context lengths by avoiding materialization of the full attention matrix and optimizing memory movement. That's a real win. But it doesn't change the underlying scaling. The number of comparisons is still the same. The model still does quadratic work; it just does that work more efficiently.
System-level workarounds — retrieval pipelines, context compaction, recursive decomposition, agentic orchestration — make dense-attention systems usable. None of them change the scaling law. They route around the limitation. The quadratic cost is the boundary they're routing around.
What prior efficient architectures gave up
The field has spent years trying to make attention cheaper. The hard part isn't reducing cost. It's reducing cost without breaking retrieval. Every prior approach traded something away.
Fixed-pattern sparse attention — sliding windows, strided patterns, dilated masks — gets subquadratic scaling by deciding in advance which positions a token can attend to. The routing decision is positional, not content-aware. The model decides where to look before it knows what it's looking for. When the relevant information falls outside the pattern, it's invisible.
State space models and recurrent alternatives drop the all-pairs comparison entirely, replacing it with a compressed state that evolves across the sequence. Linear scaling by construction — but the state has fixed capacity. Information gets summarized, blurred, or discarded as the sequence grows. Great at gist and structure, weaker at retrieving a specific fact introduced arbitrarily far back.
Hybrid architectures combine both ideas: efficient layers do most of the compute, dense attention layers preserve retrieval. Works in practice, but the dense layers stay load-bearing. As context grows, their quadratic cost dominates again. The benefit is scalar, not asymptotic.
DeepSeek Sparse Attention offsets attention's quadratic cost to a lightning indexer that selects, per query, which keys to attend to. The indexer is itself quadratic — it scores every query against every key with small constants but the same O(n²) scaling. The complexity has been moved, not removed.
The pattern is consistent. Fixed sparsity gives up content-dependent routing. Recurrent models give up exact retrieval. Hybrids reintroduce the original cost. DeepSeek-style indexers stay quadratic and become cost-prohibitive at scale.
The open problem isn't "make attention faster." It's: build a mechanism that's efficient, content-dependent, and capable of retrieving from arbitrary positions across long context.
How SSA works
SSA changes how attention work is allocated. The core idea is content-dependent selection: for each query, the model selects which parts of the sequence are worth attending to, and computes attention exactly over those positions.
Dense attention assumes every pair might matter, so it evaluates all of them. In practice, almost none do. SSA drops that assumption. It doesn't approximate attention — it restricts attention to the positions that actually carry signal, and skips the rest.
That gives SSA three properties that matter together:
- Linear scaling in compute and memory. Attention cost grows with the number of selected positions, not the full sequence. Long context becomes economically usable.
- Content-dependent routing. The model decides where to look based on meaning, not position. Relevant information can be retrieved regardless of where it appears.
- Sparse retrieval from arbitrary positions. Unlike recurrent or compressed approaches, SSA preserves the ability to recover specific information introduced far earlier in the sequence.
The practical distinction matters: SSA is not just a faster implementation of dense attention. It reduces the amount of attention work the model performs. That reduction is what shows up as speed.
Measured in wall-clock input processing time on B200s, SSA achieves the following speedups over standard attention with FlashAttention-2 (FlashAttention-3 did not produce a speedup over FA-2 on B200s):
| Context length | SSA speed increase vs. FlashAttention |
|---|---|
| 128K | 7.2× |
| 256K | 13.2× |
| 512K | 23.0× |
| 1M | 52.2× |
This is the throughput inversion that matters in production. Dense attention becomes slower relative to SSA as context grows. SSA gets more advantageous exactly where long-context workloads become most valuable.
Training SSA for long-context behavior
Architecture is necessary but not sufficient. A model can have a long context window and still fail to use it well. SSA was trained to make long-context use reliable, not just possible.
The training pipeline is three stages:
- Pre-training establishes base language modeling capability and the long-context representations the selection mechanism uses.
- Supervised fine-tuning shapes behavior toward instruction following, structured reasoning, and the code generation patterns enterprise workloads need.
- Reinforcement learning targets the behaviors that are hardest to induce through supervised examples: reliable long-context retrieval, and coding behavior that uses the available context aggressively instead of defaulting to local reasoning.
That last stage is the one developers should care about. Long-context failures often look plausible. A model answers from nearby context because nearby evidence is easier to use, even when the decisive evidence is much earlier. It produces a locally correct patch that violates an interface defined elsewhere. It summarizes a prior decision instead of preserving the exact constraint that should govern a later step. SSA's RL stage is designed around exactly those failure modes.
Training data emphasizes long-form sources with high information density and cross-reference structure — the kind of data that forces the selection mechanism to learn routing over large positional distances. The goal isn't benchmark memorization. It's teaching the model to attend to what matters regardless of where it sits.
Why the training infrastructure matters too
Long-context training isn't only a modeling problem. It's a systems problem that only shows up at scale. At million-token sequence lengths, failure modes that are invisible at shorter contexts become binding — memory pressure, sequence partitioning across devices, gradient instability, numerical precision, kernel efficiency. These determine whether training runs at all.
The SSA training stack runs stably at 1M tokens and beyond, maintains linear memory scaling across the training pipeline, and uses distributed sequence parallelism to shard sequences across devices when they exceed single-device limits.
The consequence isn't just that long-context training becomes possible. It becomes iterable.
Under dense attention, long-context experiments are expensive enough that they get treated as reserved runs. With SSA's linear scaling, they become routine. More ablations, more evaluations, faster feedback, targeted fixes on the behaviors that actually matter at long context.
That's the deeper implication. SSA doesn't only reduce the cost of inference. It reduces the cost of learning long-context behavior in the first place — and that's the thing that compounds for developers downstream.
Evaluating functional context, not nominal context
An advertised context window doesn't tell you how much context a model can use. The real question is whether the model can retrieve, connect, and reason over evidence distributed across that window.
SubQ is evaluated across two axes:
- Deployment viability — compute reduction and wall-clock speed
- Retrieval capability — RULER, MRCR v2, and SWE-Bench Verified
More general benchmarks will be published in the upcoming model card. Needle-in-a-Haystack tests exact retrieval of a single target. RULER extends that to multi-hop retrieval, aggregation, variable tracking, and selective filtering. MRCR v2 goes further: the model must locate and integrate multiple pieces of evidence distributed across the context, where the relevant set isn't given in advance. That's closer to the shape of real work — finding one fact isn't enough; the model has to determine which pieces matter and combine them into a coherent answer.
Results
Compute and speed
SSA's linear scaling means doubling context length doubles attention compute, rather than quadrupling it. At 1M tokens, that's a 62.5× attention FLOP reduction relative to standard quadratic attention.
| Context length | Attention FLOP reduction vs. standard attention |
|---|---|
| 128K | 8× |
| 1M | 62.5× |
Wall-clock speed is the more product-relevant result: a 52.2× prefill speedup over dense attention at 1M tokens. That's the difference between a long-context system that behaves like an interactive tool and one that feels like an offline batch job.
| Context length | Input processing speed increase |
|---|---|
| 128K | 7.2× |
| 256K | 13.2× |
| 512K | 23.0× |
| 1M | 52.2× |
RULER
RULER tests retrieval and reasoning beyond simple needle lookup — multi-hop retrieval, aggregation, variable tracking, selective filtering.
| Model | RULER @ 128K |
|---|---|
| SSA / SubQ | 95.0% |
| Opus 4.6 | 94.8% |
For real workflows this matters because multi-hop tasks compound. A missed reference early in the chain can corrupt every conclusion downstream.
MRCR v2
MRCR v2 is the most demanding retrieval benchmark in this set. It evaluates the ability to locate and integrate multiple non-adjacent pieces of evidence across long context.
| Model | MRCR v2 score |
|---|---|
| SSA / SubQ | 65.9% |
| Opus 4.6 | 78.3% |
| GPT 5.5 | 74.0% |
| GPT 5.4 | 36.6% |
| Opus 4.7 | 32.2% |
| Gemini 3.1 Pro | 26.3% |
SubQ lands at 65.9% — solidly in the range of frontier dense models, well ahead of GPT 5.4, Opus 4.7, and Gemini 3.1 Pro. That's the clearest evidence for the gap between nominal and functional context. A model can accept a long input and still fail to reason reliably over that input. MRCR v2 surfaces the gap because it requires retrieval and combination, not just token processing.
SWE-Bench Verified
SWE-Bench Verified is an end-to-end software engineering benchmark on real GitHub issues. Not a pure retrieval test — it asks whether the model can use codebase understanding to localize bugs, reason about implementation constraints, and produce patches.
| Model | SWE-Bench Verified |
|---|---|
| SSA / SubQ | 81.8% |
| Opus 4.7 | 87.6% |
| Opus 4.6 | 80.8% |
| Gemini 3.1 Pro | 80.6% |
| GPT 5.4 | not reported |
| GPT 5.5 | not reported |
Sitting at 81.8% — ahead of Opus 4.6 and Gemini 3.1 Pro on a real-world coding benchmark while running on a subquadratic architecture — is the result that should land hardest for developers. This is the workload most of us actually care about.
The part nobody priced in
Step back from the architecture for a second and look at what the current AI industry is actually selling.
The valuations, the capex, the data center buildouts, the multi-year compute contracts — all of it is underwritten by an assumption that frontier intelligence requires frontier-scale spend. Long context costs a lot. Reasoning costs a lot. Agents cost a lot. The premise running through every pitch deck and earnings call is that the labs with the most GPUs win, and the rest of the market pays for tokens at whatever margin those labs choose.
SSA is one architecture, on one model, with one set of benchmarks. But the result it points at is uncomfortable for that premise: the dominant cost of long-context inference may not be a law of physics — it may be an artifact of dense attention. A 52.2× prefill speedup at 1M tokens isn't a 10% efficiency gain. It is the kind of step-change that, if it generalizes, rewrites the unit economics of the entire industry.
If you don't have to maximize tokens consumed and dollars burned to get frontier-quality long-context behavior, a lot of the moat narrative collapses with it.
Why the incumbents look more fragile than they're priced
The Friendster and MySpace comparison isn't snark — it's a specific lesson. Both had network effects. Both had brand. Both had scale advantages that looked durable right up until a better-architected product showed up and the users moved over a weekend. The moat people talked about (network effects, switching costs) turned out to be much weaker than the moat that actually mattered (a better product on a better stack).
The current frontier labs have a similar mismatch:
- API-level switching cost is near zero. Most production code paths abstract the model behind a thin client. Swapping providers is a config change, not a migration.
- Compute scarcity is the moat people brag about. It is also the moat that subquadratic architectures attack first. If a challenger can match frontier quality at a fraction of the FLOPs, the capex advantage flips into a capex liability — billions of dollars of GPU contracts depreciating against a more efficient successor.
- Pricing power assumes scarcity. Today's per-token prices for long context look reasonable because the underlying compute is genuinely expensive. Drop the cost of a 1M-token prefill by 50× and the same prices start looking like rent extraction, not value capture.
- Brand isn't a defense once parity exists. "Nobody got fired for buying OpenAI" works until a model with comparable benchmarks costs an order of magnitude less to serve. Then it works against them, the same way "nobody got fired for choosing IBM" did.
This isn't a prediction that any specific lab disappears. Anthropic, OpenAI, and Google have real assets — distribution, talent, training data, alignment research, regulatory relationships. Those don't evaporate. But the valuations and the pricing power are built on the assumption that frontier compute is a stable moat, and that assumption depends on dense attention staying expensive.
SSA is one of the first credible signals that it might not.
What developers should actually take away
Strip out the industry analysis and the practical takeaways for anyone building on top of these systems are pretty clean:
- Long context as a product surface is about to get a lot cheaper and a lot better. If you've been deferring long-context features because the economics didn't pencil, the economics are about to pencil.
- A nominal context window has never told you what a model can actually use. RULER 95.0% and MRCR v2 65.9% on a subquadratic architecture is the gap between marketing tokens and functional tokens, and that gap is closing.
- Less hand-authored scaffolding. Chunking, recursive summarization, and bespoke orchestration are workarounds for an attention bottleneck. As that bottleneck loosens, the scaffolding becomes a maintenance burden rather than an asset.
- Watch where the open and challenger labs go next. Efficient architectures disproportionately benefit teams that don't already own a hyperscaler-sized GPU fleet. The next frontier-quality model that runs cheaply on commodity infra is the one to track.
- Don't lock into long-term commitments priced on dense-attention economics. Multi-year contracts written against today's per-token costs are the riskiest thing on the table if a successor architecture cuts those costs by an order of magnitude.
SSA on its own is one paper, one architecture, one set of numbers. The reason it's worth paying attention to is what it implies if the result is real and replicable: the AI bubble's tightest correlation — bigger spend, better model — gets a lot weaker. That's good for developers, good for customers, and meaningfully bad for any incumbent whose story to investors depends on the old curve holding.
The Friendsters and MySpaces of this cycle won't lose because their products got worse. They'll lose because someone shows up with a better-architected stack at a fraction of the cost, and the switching cost turns out to have been a config flag the whole time.
Worth watching.
Top comments (0)