Qwen3-Coder-Next: 80B total, 3B active, 70.6 on SWE-Bench

#opensource #llm #agents #ai

Qwen3-Coder-Next runs 3 billion parameters per token. It scores 70.6 on SWE-Bench Verified with the SWE-Agent scaffold. Both numbers are true at the same time, and the gap between them is where the interesting architectural ideas live.

TL;DR

80B total, 3B active. A sparse Mixture-of-Experts (MoE) router picks 10 of 512 experts per token. The remaining 79B-ish parameters sit idle.
Hybrid attention. 48 layers arranged as twelve repeats of a 4-layer block: 3 Gated DeltaNet layers (linear attention) followed by 1 standard Gated Attention layer. The cheap layers carry the long-context bandwidth; the expensive layer rebuilds the global picture.
Apache 2.0 weights, 262K native context. Coding-tuned variant of the Qwen3-Next-80B-A3B base. Hugging Face card here.

Background

SWE-Bench Verified is a curated subset of real GitHub issues from popular Python repos. The agent reads the issue, edits the repo, and a hidden test suite runs. Pass rate is the score. 70.6 is in striking distance of the best closed-source frontier models on this benchmark — and Qwen3-Coder-Next gets there with weights you can download.

The two ideas that get it there — linear-time attention and sparse expert routing — are usually discussed separately. The interesting thing about this model is how they compose.

The two architectural ideas, separately

Gated DeltaNet, in one paragraph

Standard attention is O(L²) in sequence length L because every token has to look at every previous token. Gated DeltaNet is a linear attention variant: each head maintains a small fixed-size matrix as recurrent state, and new tokens update that state via a delta rule. Per-token cost is O(1) regardless of how far back the context goes.

# Conceptual shape only — actual kernels are fused.
# Standard attention: O(L) work per new token.
attn_out = softmax(q @ K.T / sqrt(d)) @ V   # K, V grow with L

# Gated DeltaNet: O(1) work per new token.
state = (1 - gate) * state + gate * (k_t.unsqueeze(-1) @ v_t.unsqueeze(-2))
out_t = q_t @ state                          # state is fixed-size

The honest tradeoff: linear attention has measurably worse recall on long-range precise lookups. You buy throughput; you pay a small recall tax.

The 10-of-512 MoE router

The feed-forward block at each MoE layer is replaced by 512 expert MLPs plus 1 shared expert. A small router network reads the token's hidden state, picks the 10 best experts, runs them, and weights their outputs. The shared expert always runs.

# Per token, per MoE layer:
scores = router(hidden)                      # shape: [512]
top_k_ids, top_k_weights = topk(scores, 10)  # pick 10
expert_out = sum(top_k_weights[i] * experts[top_k_ids[i]](hidden)
                 for i in range(10))
out = expert_out + shared_expert(hidden)

Active parameters are what's billed per token: the router, the 10 chosen experts, the shared expert, and the rest of the layer. Total parameters are what's billed in GPU memory. For Qwen3-Coder-Next that's 3B active vs 80B total. Active sets inference FLOPs. Total sets the capacity ceiling.

Why stacking them works for code

Most coding tasks have the same shape. The model needs the whole repo in context, but it only emits a few hundred tokens of patch. The interesting bits are concentrated: function signatures, import graphs, the file where the bug actually lives.

The 3:1 hybrid layout matches that shape. Three Gated DeltaNet layers cheaply scroll the long context into compressed state. One full-attention layer reassembles a precise global picture. Repeat twelve times. Meanwhile the MoE router picks code-specialized experts on demand — different experts likely fire for Python parsing vs. shell commands vs. SQL fragments — without paying the dense-model price.

Per 4-layer block:
[Gated DeltaNet → MoE]  ← cheap, recurrent state
[Gated DeltaNet → MoE]  ← cheap, recurrent state
[Gated DeltaNet → MoE]  ← cheap, recurrent state
[Gated Attention → MoE] ← full O(L) attention, global picture

The end-to-end numbers from the model card: 70.6 on SWE-Bench Verified, 44.3 on SWE-Bench Pro, 36.2 on TerminalBench 2.0. Pro is harder than Verified, TerminalBench tests shell sessions — the drop is what you'd expect.

What it changes for builders

This is the first model we've looked at where the "small-active, large-total" framing translates cleanly to single-workstation deployment for autonomous coding agents. With 3B active, inference cost looks like a small dense model; with 80B total, capacity looks like a mid-size dense model; with 262K context, you can fit most real repos. Apache 2.0 means you can ship it inside a product.

The practical move for a learner: pull the weights, wire them into the SWE-Agent scaffold, and run on a few of your own GitHub issues. The benchmark number isn't the point. The point is feeling how the active/total split affects latency vs. quality on your actual code.

Caveats and open questions

Benchmark vs. reality. 70.6 SWE-Bench Verified is with the SWE-Agent scaffold doing the planning, tool calls, and retries. The raw model on a closed-source repo with no scaffold is a different story.
Recall tax of linear attention. Long-range, precise needle-in-a-haystack retrieval is where Gated DeltaNet underperforms standard attention. Worth probing on your codebase.
Routing stability under fine-tuning. Sparse MoE routers can collapse to a handful of experts after light fine-tuning, especially with small batch sizes. If you fine-tune, watch the expert-utilization histograms.

Full architecture detail and config files live on the Qwen3-Coder-Next Hugging Face card; the base model card has the same 48-layer hybrid layout if you want to compare the coding fine-tune to the general-purpose variant.