Jangwook Kim

Posted on May 25 • Originally published at effloow.com

SubQ: Sub-Quadratic LLM with 12M Token Context — 2026 Guide

#subq #subquadratic #longcontextllm #sparseattention

Why This Matters for Developers

On May 5, 2026, a Miami-based AI startup called Subquadratic emerged from stealth with a model that challenges the most persistent bottleneck in large language model deployment: the quadratic cost of attention.

The product is SubQ — positioned as the first fully sub-quadratic frontier LLM — with a 12-million-token context window and benchmarks claiming 1,000× lower compute versus frontier models at maximum context length. Alongside the model, the company launched three developer-facing products: a production API, a CLI coding agent, and a long-context search tool.

If the architecture works as claimed, SubQ could change what is possible in a single API call: complete enterprise codebases, months of commit history, full document sets — all loaded at once, no chunking required. That is the promise. This guide covers what is verifiable today, what remains unproven, and how to position yourself as a developer ahead of wider availability.

The Context Window Problem, Explained

Every transformer-based LLM performs attention by computing relevance scores between every token pair in the input. For a context of n tokens, that is roughly n² comparisons. Double the context and the compute quadruples. This is why:

A model with a 200K window is dramatically more expensive to run than one with 8K
Providers charge significantly more per token at long contexts
Most "million-token" models are prohibitively expensive at scale
Developers resort to RAG, chunking, and summarization workarounds

The workarounds work — but they introduce retrieval errors, context fragmentation, and architectural complexity. The industry has been waiting for a fundamentally different attention mechanism.

How SubQ's SSA Architecture Works

Subquadratic's approach is called SSA — Subquadratic Selective Attention. The core idea:

Standard transformers compute dense attention: each query token scores against every key token, producing an n × n matrix. Most of those scores are near-zero — the model wastes compute confirming that token 5 is irrelevant to token 4,000,000.

SSA changes the routing. For each query token, the model first selects a small content-based subset of positions to attend to, then computes exact attention only over that subset. This is not fixed-pattern sparsity (like sliding windows or block attention) — the selection is dynamic, based on what the content actually requires. The result is near-linear scaling in both compute and memory as context length grows.

According to Subquadratic's published numbers, SSA runs 52× faster than FlashAttention at 1M tokens while using 63% less compute. At the full 12M-token window, the efficiency gap widens to approximately 1,000× versus standard dense attention.

Important caveat: No independent technical report or peer-reviewed paper has been published as of May 2026. The company website lists "paper coming soon." Every architectural claim here comes from Subquadratic's own blog post and marketing materials.

SubQ Products: What's Available Today

Three products launched with the May announcement, all in private beta requiring waitlist approval:

SubQ API

The developer-facing API exposes SubQ's model via OpenAI-compatible endpoints. The production tier currently supports up to 1 million tokens per context. The full 12M-token window is available in private preview for enterprise partners.

If you already have an OpenAI-compatible stack, the integration path is a base URL swap:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_SUBQ_KEY",
    base_url="https://api.subq.ai/v1"
)

response = client.chat.completions.create(
    model="subq-1m-preview",
    messages=[
        {"role": "user", "content": "Analyze this entire codebase: ..."}
    ]
)
print(response.choices[0].message.content)

Public pricing has not been announced. The company's cost claims — 50× cheaper than frontier models at 1M tokens — are vendor-reported and have not been independently validated.

SubQ Code

SubQ Code is a CLI coding agent designed to load entire repositories into a single context window. Unlike agents that chunk codebases or use RAG to retrieve relevant files, SubQ Code passes the full repo in one call — preserving architectural relationships, cross-module dependencies, and historical patterns that would otherwise be lost.

Target workflows include:

Large-scale refactoring: Understanding call graphs across hundreds of files before proposing changes
Codebase onboarding: Asking architecture questions against an entire unfamiliar repository
Technical debt analysis: Cross-module pattern detection without context fragmentation
CI/CD integration: Automated code review that sees the whole picture

SubQ Code is currently in private beta with access available by request.

SubQ Search

SubQ Search is a free-tier research tool built on the same model, targeted at long-document analysis tasks: legal document review, scientific paper synthesis, multi-quarter financial analysis. No API access required — it is a web interface that accepts large document uploads and questions.

Benchmark Performance

Subquadratic has published results on three benchmarks. These are vendor-reported numbers. Independent reproduction has not occurred as of this writing, and a LayerLens evaluation is pending publication.

Benchmark	SubQ	Claude Opus 4.6	Cost Advantage
RULER 128K	95.0%	94.8%	SubQ: $8 vs $2,600
MRCR v2 (1M tokens)	65.9%	—	Not comparable
SWE-Bench Verified	81.8%	80.8%	[DATA NOT AVAILABLE]
Compute at 12M tokens	~1,000× less	Baseline	Vendor claim only

The RULER 128K result is notable: near-identical accuracy to Claude Opus 4.6 at approximately 300× lower cost in that test run. The SWE-Bench Verified score of 81.8% would place SubQ competitively against current frontier coding models — but only if independently confirmed.

One figure worth noting critically: Subquadratic's internal results show a 17-point gap between their research (83.0) and production (65.9) MRCR scores. That internal variance is larger than most differences between frontier models. It suggests production performance may differ meaningfully from research conditions, and warrants scrutiny before architectural commitments.

What the Skeptics Are Saying

VentureBeat reported that AI researchers are demanding independent proof of Subquadratic's efficiency claims. The criticisms cluster around three points:

No technical paper. Standard practice for a novel architecture is a pre-print or peer-reviewed paper. "Coming soon" after a fundraising announcement is a yellow flag, though not unusual for early-stage infrastructure companies.

The Magic.dev precedent. In 2024, Magic.dev raised significant capital on claims of a million-token coding model. As of mid-2026, no independent validation has materialized. Researchers cite this as a cautionary case.

Narrow benchmark scope. The published results cover long-context recall (RULER, MRCR) and coding (SWE-Bench). General reasoning, math, multilingual performance, and safety evaluations are absent. A model can excel on retrieval benchmarks while underperforming on tasks that require deep reasoning across the full context.

The assessment from several ML commentators: "the marketing is ahead of the evidence." The technical architecture is plausible — sparse attention mechanisms are well-studied — but the specific scaling claims need external reproduction before they can be treated as engineering facts.

Strengths
<ul>
  <li>Real company, real funding, real products in beta — not vaporware</li>
  <li>OpenAI-compatible API means minimal integration friction</li>
  <li>SSA is a theoretically sound approach to quadratic scaling</li>
  <li>12M context (and 50M target for Q4 2026) opens genuinely new use cases</li>
  <li>SubQ Search is free — usable without waitlist access</li>
</ul>


Weaknesses
<ul>
  <li>No independent benchmark reproduction as of May 2026</li>
  <li>No published technical paper (claimed "coming soon")</li>
  <li>Production API gated at 1M tokens, not 12M</li>
  <li>17-point internal gap between research and production MRCR scores</li>
  <li>No evaluations on reasoning, math, safety, or multilingual tasks</li>
</ul>

Getting Access: The Waitlist Path

All three SubQ products require applying for early access at subq.ai. The company is prioritizing enterprise use cases with large context demands: legal tech, financial analysis, large-scale code review, and scientific research.

For developers who want to experiment now, the practical path:

Request access at subq.ai — specify your use case and expected context sizes
Try SubQ Search — the free research tool is the fastest way to experience the model without an API key
Prepare your stack — since the API is OpenAI-compatible, no architectural changes are needed once access arrives; the integration will be a base URL and key swap

Common Mistakes to Avoid

Treating vendor benchmarks as engineering specs. The RULER 128K numbers are interesting; they should not drive architecture decisions until independently reproduced. Design your system to be model-agnostic at the API boundary.

Assuming 12M tokens is the default. The production API currently caps at 1M tokens. The 12M window is in private preview for enterprise accounts. Plan your context budget accordingly.

Expecting general-purpose parity with frontier models. SubQ's benchmarks are concentrated on long-context retrieval and coding. If your application requires deep multi-step reasoning, complex math, or broad multilingual support, you need to evaluate those dimensions independently.

Conflating linear attention with SSA. Other models use linear attention approximations that sacrifice quality for speed. Subquadratic's claim is that SSA computes exact attention over a content-selected sparse subset — preserving quality while achieving linear scaling. These are different mechanisms; "sub-quadratic" is not a synonym for "approximate."

What to Watch For

Several signals will clarify SubQ's actual position over the next few months:

The LayerLens evaluation: Subquadratic has partnered with LayerLens for independent benchmarking, with results to be published at stratix.layerlens.ai. This is the most important external validation pending.
The technical paper: A peer-reviewed or pre-print paper explaining SSA in detail will be the architectural credibility threshold for most ML engineers.
Production API expansion: If Subquadratic opens the full 12M window to standard API customers, independently measurable performance data will emerge quickly from the community.
50M token roadmap: The Q4 2026 target for a 50M token context would extend the lead significantly — but is contingent on resolving current production limitations first.

Q: Can I use SubQ with my existing LangChain or LlamaIndex setup?

Yes. Because SubQ uses OpenAI-compatible endpoints, most frameworks that support a custom base URL will work without modifications. LangChain's ChatOpenAI and LlamaIndex's OpenAI class both accept a base_url parameter.

Q: Is SubQ a fine-tune or a new architecture?

It is a new architecture. SSA is not a fine-tuned variant of GPT or LLaMA — Subquadratic built a transformer replacement where the attention mechanism is fundamentally redesigned to scale linearly. The weights are proprietary and not released.

Q: How does 12M tokens compare to other frontier models in 2026?

At this writing, the highest publicly available context windows from major providers are in the range of 1M–2M tokens (Gemini 2.5 Ultra, Claude Opus 4.6). A genuine 12M context window would be 6–12× larger than any current production offering — which is why the claim is receiving both excitement and skepticism.

Q: Should I use SubQ for production workloads today?

Not yet. The API is in private beta, pricing is undisclosed, and no independent benchmark reproduction exists. Use it for experimentation and benchmarking your specific use cases. Commit to production use only after independent validation arrives and the API exits beta.

Key Takeaways

SubQ represents a meaningful architectural bet: that sparse, content-dependent attention can deliver long-context performance at a fraction of current costs. The architecture is theoretically sound. The company is real, funded, and shipping products. The benchmark numbers — if they hold under independent scrutiny — point to a step-change in what is economically practical at million-token scales.

But "if they hold" is doing real work in that sentence. No independent reproduction, no technical paper, and a 17-point internal performance gap mean that the honest developer position in May 2026 is: watch carefully, experiment early, do not commit architecturally until external validation arrives.

Get on the waitlist. Try SubQ Search. Keep your API integration model-agnostic. When the LayerLens results and the technical paper land, you will be positioned to evaluate quickly.

Bottom Line

SubQ is the most technically interesting long-context LLM launch of 2026 — and the one most in need of independent verification before it earns a place in production stacks. The architecture is plausible, the benchmarks are promising, and the products are real. Get on the waitlist, experiment with SubQ Search, and wait for the LayerLens evaluation before making any architectural bets.

DEV Community