Qwen just released Qwen3.6-35B-A3B — the first model in their 3.6 series. It's a Mixture-of-Experts model with 35 billion total parameters but only 3 billion active at inference time.
Translation: big-model quality at small-model speed. And this time it has vision built in.
Why this model matters
The numbers speak for themselves:
- 73.4 on SWE-bench Verified — this is an agentic coding benchmark where the model autonomously fixes real GitHub issues. For reference, Gemma4-31B (a dense model with all 31B params active) scores 17.4. Qwen3.6 scores 4x higher with 10x fewer active parameters.
- 51.5 on Terminal-Bench 2.0 — agentic terminal coding. It beats Qwen3.5-27B (41.6), its own predecessor Qwen3.5-35B-A3B (40.5), and even Gemma4-31B (42.9).
- 1397 Elo on QwenWebBench — frontend artifact generation. The predecessor scored 978. That's a 400+ Elo jump in one generation.
- 86.0 on GPQA Diamond — graduate-level science reasoning. Competitive with models many times its size.
- Vision support — handles image-text-to-text tasks natively. MMMU score of 81.7, RealWorldQA at 85.3.
The full benchmark picture:
| Benchmark | Qwen3.6-35B-A3B | Qwen3.5-35B-A3B | Gemma4-31B | Qwen3.5-27B |
|---|---|---|---|---|
| SWE-bench Verified | 73.4 | 70.0 | 17.4 | 51.2 |
| Terminal-Bench 2.0 | 51.5 | 40.5 | 42.9 | 41.6 |
| SWE-bench Multilingual | 75.0 | 67.2 | 69.3 | 60.3 |
| QwenWebBench (Elo) | 1397 | 978 | 1178 | 1197 |
| NL2Repo | 29.4 | 20.5 | — | 27.3 |
| MCPMark | 37.0 | 27.0 | 36.3 | 15.5 |
| GPQA Diamond | 86.0 | 84.2 | 84.3 | 85.5 |
| MMMU | 81.7 | 81.4 | 80.4 | 82.3 |
What's under the hood
This isn't just a bigger Qwen3.5. The architecture got meaningful upgrades:
Gated DeltaNet attention — 3 out of every 4 layers use linear attention (Gated DeltaNet) instead of standard attention. Only every 4th layer uses full Gated Attention. This makes it much more memory-efficient for long contexts.
256 experts, 9 active — 8 routed + 1 shared expert active per token. That's where the "35B total, 3B active" comes from. Most of the model sits idle while only the relevant experts fire.
Vision encoder built in — it's a true multimodal model (Image-Text-to-Text), not a text model with a bolted-on adapter.
Thinking Preservation — new feature that retains reasoning context from previous messages. Less overhead for iterative coding sessions.
262K native context — extensible beyond that.
Apache 2.0 license — fully open, commercial use allowed.
Hardware requirements
The beauty of MoE: your hardware only needs to handle the active parameters, not the total count.
| Setup | VRAM needed | Expected speed |
|---|---|---|
| Q4_K_M quant | ~6-8 GB | 30+ tok/s on RTX 3060 12GB |
| Q8_0 quant | ~12-14 GB | 20+ tok/s on RTX 4070 |
| FP8 (official) | ~35 GB | RTX 4090 or A6000 |
| FP16 full | ~70 GB | Multi-GPU setup |
If you can run a 7B model, you can run this. The 3B active parameter count is the number that matters for speed.
How to run it
Option 1: Ollama (easiest)
ollama run qwen3.6:35b-a3b
Wait for GGUFs to appear — usually within hours of release. Check HuggingFace for the latest quantized versions.
Option 2: vLLM
pip install vllm
vllm serve Qwen/Qwen3.6-35B-A3B --tensor-parallel-size 1
Option 3: Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.6-35B-A3B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")
Option 4: Locally Uncensored (full GUI + model management)
If you want a clean desktop app that handles downloading, model management, and chatting in one place:
- Grab Locally Uncensored — it's open source (AGPL-3.0)
- v2.3.3 just shipped with day-0 Qwen3.6 support
- Download the model directly from the app, pick your quantization, and start chatting
- Vision works out of the box — drag and drop images into the chat
- The new Codex mode with live streaming is particularly nice for coding tasks with this model
LU also has agent mode with 13 tools, remote access from your phone, and a bunch of other stuff that pairs well with an agentic model like this one.
Who should care
- Local AI coders — if you use AI for coding and want to run it locally, this is now the best MoE option. 73.4 SWE-bench with 3B active params is absurd.
- Privacy-focused devs — Apache 2.0, runs on consumer hardware, no data leaves your machine.
- Multimodal users — built-in vision means one model for text AND image understanding.
- Anyone running Qwen3.5-35B-A3B — this is a straight upgrade. Same architecture class, better everything.
The bottom line
Qwen3.6-35B-A3B is what happens when you optimize MoE properly. 3B active parameters shouldn't be this good, but here we are. The coding benchmarks in particular are hard to argue with — 73.4 on SWE-bench Verified puts it in the same league as much larger, closed-source models.
Weights are on HuggingFace. FP8 variant here. GGUFs incoming.
Locally Uncensored is an open-source desktop app for running AI models locally with full privacy. Handles model downloads, chat, coding agents, image generation, and more. AGPL-3.0 licensed.
Top comments (0)