12 million tokens, linear cost: Subquadratic's bet against the attention tax

#ai #llm #api #machinelearning

The quadratic attention problem has quietly shaped everything you've built with LLMs. RAG pipelines, agentic decomposition, hybrid architectures — these aren't the natural shape of AI systems. They're workarounds. Doubling the context quadruples the compute, so everyone stopped at a million tokens and engineered around the rest.

Subquadratic, a Miami-based startup with 11 PhD researchers on staff, launched its first model this week and says it's done with workarounds. Its new architecture — Subquadratic Selective Attention (SSA) — claims linear scaling in both compute and memory with respect to context length. The result: a 12-million-token context window, available in API today.

"For prompt A, words one and six are going to be important to each other. For prompt B, maybe it's words two and three. It's different for every single input." — Alex Whedon, CTO

What actually changed

The quadratic bottleneck comes from dense attention: with 1,000 tokens, every token attends to every other — 1,000² comparisons. Sparse attention (Longformer, Mamba, DeepSeek's NSA) tried to fix this by only processing the combinations that matter. The problem: figuring out which combinations matter usually requires... quadratic work.

SSA's claim is that it does content-dependent selection — picking relevant positions based on what the query and keys actually contain — without the selection step going quadratic. That's the crux.

The benchmarks:

MRCR v2: 83.0% — vs. GPT-5.5 at 74.0% and Claude Opus 4.6 at 32.2%
Needle-in-a-haystack at 12M tokens: 92.1% — no frontier model operates at this length
RULER at 128K: 97.1% vs. Opus 4.6's 94.8%
SWE-Bench Verified: 82.4%, edging Opus 4.6 (81.4%) and Gemini 3.1 Pro (80.6%)
Speed: 7.2× faster at 128K, 52× faster at 1M vs. dense attention

Why this is interesting (and why to stay cautious)

If SSA's scaling claim holds, the implications go beyond "bigger context window." It changes the ROI of RAG and agentic decomposition at scale. Right now, those patterns exist because the alternative — throwing everything in context — is economically untenable. Linear scaling changes that calculus.

The cautionary note writes itself, though. Magic.dev announced a 100M-token context window in 2024, raised over $500M on the strength of it, and as of early 2026 there's no public evidence of real-world usage outside Magic.

Subquadratic's caveats are also worth reading: each benchmark was run once due to inference costs, the SWE-Bench margin is "harness as much as model" by the team's own admission, and the model is smaller than frontier labs. The company has raised $29M at a $500M valuation — not nothing, but not Magic.dev territory yet.

What to do

Building RAG or long-context retrieval? SSA is the architecture to watch. Get on the API beta and run your own evals against your actual use case.
Using the OpenAI or Anthropic API? No urgency to switch, but benchmark MRCR v2 on your retrieval tasks — if you're hitting the 74% ceiling, this gap is real.
Shipping a coding agent? SubQ Code (CLI agent) is available in beta. Worth a test on your harness.
Evaluating long-context models? 12M is a genuinely new operating range. Needle-in-a-haystack at that length hasn't been tested at the frontier before.

The 50M-token context window is targeting Q4. The technical case is real. The category's track record is the rest of the story.

Source: The New Stack — Subquadratic debuts a 12-million-token window

✏️ Drafted with KewBot (AI), edited and approved by Drew.