DEV Community

David
David

Posted on

qwen3.6 scores 73.4 on SWE-bench with only 3B active parameters. here's why that matters.

Alibaba just mass-released Qwen3.6 and the first model is already turning heads. Qwen3.6-35B-A3B is a Mixture-of-Experts model with 35 billion total parameters — but only 3 billion are active at inference time.

That means it runs on an 8GB GPU. And it just scored 73.4 on SWE-bench Verified.

For context, Gemma4-31B — a dense model using all 31 billion parameters for every single token — scores 17.4 on the same benchmark. Qwen3.6 uses a tenth of the compute and scores four times higher.

the architecture is genuinely different

Most MoE models just slap a router on top of a standard transformer. Qwen3.6 does something more interesting.

Three out of every four layers use Gated DeltaNet — a linear attention mechanism that's significantly cheaper than standard attention. Only every fourth layer uses full Gated Attention with KV cache. This hybrid layout means you get near-full-attention quality at a fraction of the memory cost, especially on long contexts.

The expert setup: 256 total experts, 8 routed + 1 shared active per token. That's where the 35B→3B compression comes from. Each token only touches the experts it needs.

And it has vision built in. Not bolted on — the model is natively multimodal (Image-Text-to-Text). MMMU score of 81.7, RealWorldQA at 85.3.

the benchmarks that matter

I'm not going to dump every number. Here are the ones that actually tell you something:

SWE-bench Verified: 73.4 — this is the "can you autonomously fix real GitHub issues" test. The model reads the issue, understands the codebase, writes a fix, and runs the tests. 73.4 means it successfully fixes nearly three out of four real-world bugs thrown at it. Its predecessor (Qwen3.5-35B-A3B) scored 70.0. Gemma4-31B scored 17.4.

Terminal-Bench 2.0: 51.5 — agentic terminal coding. Can the model operate a terminal to solve coding tasks? Qwen3.6 beats its predecessor (40.5), the dense Qwen3.5-27B (41.6), and Gemma4-31B (42.9). An 11-point jump over the previous version is massive.

QwenWebBench: 1397 Elo — frontend artifact generation. The predecessor scored 978. A 400+ Elo jump in one generation. For chess players: that's going from a club player to a titled player.

GPQA Diamond: 86.0 — graduate-level science reasoning. This is the benchmark where PhD students in physics, chemistry, and biology try to answer questions outside their subfield and fail about half the time. 86.0 is competitive with models many times this size.

MCPMark: 37.0 — general agent benchmark testing MCP (Model Context Protocol) tool use. Predecessor scored 27.0. Gemma4-31B scored 36.3. This model was clearly trained with agentic tool calling in mind.

what 3B active parameters actually means for your hardware

Here's the thing people keep getting wrong about MoE models. The total parameter count (35B) determines the model's knowledge capacity — how much it "knows." But the active parameter count (3B) determines how fast it runs and how much VRAM it needs at inference time.

So while the model file on disk is large (it contains all 256 experts), at inference time your GPU only loads the 9 active experts per token. The rest sit in memory doing nothing until they're needed.

Practical VRAM requirements:

  • Q4_K_M quantized: ~6-8 GB — runs on an RTX 3060 12GB at 30+ tok/s
  • Q8_0 quantized: ~12-14 GB — RTX 4070 territory
  • FP8 official: ~35 GB — RTX 4090 or A6000
  • FP16 full: ~70 GB — multi-GPU

If you can run a 7B model, you can run this. The speed profile is similar to a 3B dense model, but the output quality is closer to a 30B+ dense model.

the real competition

The model Qwen3.6 is really competing against isn't Gemma4-31B. It's proprietary models.

73.4 on SWE-bench Verified puts it in the same ballpark as frontier closed-source models — except this one is Apache 2.0 licensed, runs on consumer hardware, and never sends your code to anyone's server.

For coding specifically, the combination of high SWE-bench scores + strong terminal/agent capabilities + MCP support makes this arguably the best local coding model per compute dollar right now.

how to actually run it

The model just dropped so GGUF quantizations are still rolling out. Check HuggingFace for the latest:

Once GGUFs land, ollama run qwen3.6:35b-a3b should work.

For a full desktop setup with model management, vision support, and a built-in coding agent, Locally Uncensored just shipped v2.3.3 with day-0 Qwen3.6 support. Open source, AGPL-3.0.

the bottom line

3B active parameters scoring 73.4 on SWE-bench is the kind of efficiency gain that changes what's possible on consumer hardware. A year ago you needed a 70B+ dense model or API access for this level of coding capability. Now it runs on a gaming laptop.

Apache 2.0. No strings attached.


Locally Uncensored is an open-source desktop app for running AI models locally — chat, coding agents, image gen, video gen. AGPL-3.0.

Top comments (0)