David

Posted on Apr 16

How to run Qwen3.6-35B-A3B locally — the coding MoE that beats models 10x its active size

#ai #opensource #tutorial #llm

Qwen just released Qwen3.6-35B-A3B — the first model in their 3.6 series. It's a Mixture-of-Experts model with 35 billion total parameters but only 3 billion active at inference time.

Translation: big-model quality at small-model speed. And this time it has vision built in.

Why this model matters

The numbers speak for themselves:

73.4 on SWE-bench Verified — this is an agentic coding benchmark where the model autonomously fixes real GitHub issues. For reference, Gemma4-31B (a dense model with all 31B params active) scores 17.4. Qwen3.6 scores 4x higher with 10x fewer active parameters.
51.5 on Terminal-Bench 2.0 — agentic terminal coding. It beats Qwen3.5-27B (41.6), its own predecessor Qwen3.5-35B-A3B (40.5), and even Gemma4-31B (42.9).
1397 Elo on QwenWebBench — frontend artifact generation. The predecessor scored 978. That's a 400+ Elo jump in one generation.
86.0 on GPQA Diamond — graduate-level science reasoning. Competitive with models many times its size.
Vision support — handles image-text-to-text tasks natively. MMMU score of 81.7, RealWorldQA at 85.3.

The full benchmark picture:

Benchmark	Qwen3.6-35B-A3B	Qwen3.5-35B-A3B	Gemma4-31B	Qwen3.5-27B
SWE-bench Verified	73.4	70.0	17.4	51.2
Terminal-Bench 2.0	51.5	40.5	42.9	41.6
SWE-bench Multilingual	75.0	67.2	69.3	60.3
QwenWebBench (Elo)	1397	978	1178	1197
NL2Repo	29.4	20.5	—	27.3
MCPMark	37.0	27.0	36.3	15.5
GPQA Diamond	86.0	84.2	84.3	85.5
MMMU	81.7	81.4	80.4	82.3

What's under the hood

This isn't just a bigger Qwen3.5. The architecture got meaningful upgrades:

Gated DeltaNet attention — 3 out of every 4 layers use linear attention (Gated DeltaNet) instead of standard attention. Only every 4th layer uses full Gated Attention. This makes it much more memory-efficient for long contexts.

256 experts, 9 active — 8 routed + 1 shared expert active per token. That's where the "35B total, 3B active" comes from. Most of the model sits idle while only the relevant experts fire.

Vision encoder built in — it's a true multimodal model (Image-Text-to-Text), not a text model with a bolted-on adapter.

Thinking Preservation — new feature that retains reasoning context from previous messages. Less overhead for iterative coding sessions.

262K native context — extensible beyond that.

Apache 2.0 license — fully open, commercial use allowed.

Hardware requirements

The beauty of MoE: your hardware only needs to handle the active parameters, not the total count.

Setup	VRAM needed	Expected speed
Q4_K_M quant	~6-8 GB	30+ tok/s on RTX 3060 12GB
Q8_0 quant	~12-14 GB	20+ tok/s on RTX 4070
FP8 (official)	~35 GB	RTX 4090 or A6000
FP16 full	~70 GB	Multi-GPU setup

If you can run a 7B model, you can run this. The 3B active parameter count is the number that matters for speed.

How to run it

Option 1: Ollama (easiest)

ollama run qwen3.6:35b-a3b

Wait for GGUFs to appear — usually within hours of release. Check HuggingFace for the latest quantized versions.

Option 2: vLLM

pip install vllm
vllm serve Qwen/Qwen3.6-35B-A3B --tensor-parallel-size 1

Option 3: Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.6-35B-A3B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")

Option 4: Locally Uncensored (full GUI + model management)

If you want a clean desktop app that handles downloading, model management, and chatting in one place:

Grab Locally Uncensored — it's open source (AGPL-3.0)
v2.3.3 just shipped with day-0 Qwen3.6 support
Download the model directly from the app, pick your quantization, and start chatting
Vision works out of the box — drag and drop images into the chat
The new Codex mode with live streaming is particularly nice for coding tasks with this model

LU also has agent mode with 13 tools, remote access from your phone, and a bunch of other stuff that pairs well with an agentic model like this one.

Who should care

Local AI coders — if you use AI for coding and want to run it locally, this is now the best MoE option. 73.4 SWE-bench with 3B active params is absurd.
Privacy-focused devs — Apache 2.0, runs on consumer hardware, no data leaves your machine.
Multimodal users — built-in vision means one model for text AND image understanding.
Anyone running Qwen3.5-35B-A3B — this is a straight upgrade. Same architecture class, better everything.

The bottom line

Qwen3.6-35B-A3B is what happens when you optimize MoE properly. 3B active parameters shouldn't be this good, but here we are. The coding benchmarks in particular are hard to argue with — 73.4 on SWE-bench Verified puts it in the same league as much larger, closed-source models.

Weights are on HuggingFace. FP8 variant here. GGUFs incoming.

Locally Uncensored is an open-source desktop app for running AI models locally with full privacy. Handles model downloads, chat, coding agents, image generation, and more. AGPL-3.0 licensed.

Top comments (1)

Felipe Spengler • Jun 10

Afaix your phrase "The 3B active parameter count is the number that matters for speed." plus the table above it might give prople the wrong idea that they can load the FP8 on a RTX 4090, which is not true. You still need to load the full model on the VRAM for high speeds.
Also not sure where you saw a Q8_0 using only ~12-14 GB of vRAM?