DEV Community

David
David

Posted on

How to run Qwen3.6-35B-A3B locally — the coding MoE that beats models 10x its active size

Qwen just released Qwen3.6-35B-A3B — the first model in their 3.6 series. It's a Mixture-of-Experts model with 35 billion total parameters but only 3 billion active at inference time.

Translation: big-model quality at small-model speed. And this time it has vision built in.

Why this model matters

The numbers speak for themselves:

  • 73.4 on SWE-bench Verified — this is an agentic coding benchmark where the model autonomously fixes real GitHub issues. For reference, Gemma4-31B (a dense model with all 31B params active) scores 17.4. Qwen3.6 scores 4x higher with 10x fewer active parameters.
  • 51.5 on Terminal-Bench 2.0 — agentic terminal coding. It beats Qwen3.5-27B (41.6), its own predecessor Qwen3.5-35B-A3B (40.5), and even Gemma4-31B (42.9).
  • 1397 Elo on QwenWebBench — frontend artifact generation. The predecessor scored 978. That's a 400+ Elo jump in one generation.
  • 86.0 on GPQA Diamond — graduate-level science reasoning. Competitive with models many times its size.
  • Vision support — handles image-text-to-text tasks natively. MMMU score of 81.7, RealWorldQA at 85.3.

The full benchmark picture:

Benchmark Qwen3.6-35B-A3B Qwen3.5-35B-A3B Gemma4-31B Qwen3.5-27B
SWE-bench Verified 73.4 70.0 17.4 51.2
Terminal-Bench 2.0 51.5 40.5 42.9 41.6
SWE-bench Multilingual 75.0 67.2 69.3 60.3
QwenWebBench (Elo) 1397 978 1178 1197
NL2Repo 29.4 20.5 27.3
MCPMark 37.0 27.0 36.3 15.5
GPQA Diamond 86.0 84.2 84.3 85.5
MMMU 81.7 81.4 80.4 82.3

What's under the hood

This isn't just a bigger Qwen3.5. The architecture got meaningful upgrades:

Gated DeltaNet attention — 3 out of every 4 layers use linear attention (Gated DeltaNet) instead of standard attention. Only every 4th layer uses full Gated Attention. This makes it much more memory-efficient for long contexts.

256 experts, 9 active — 8 routed + 1 shared expert active per token. That's where the "35B total, 3B active" comes from. Most of the model sits idle while only the relevant experts fire.

Vision encoder built in — it's a true multimodal model (Image-Text-to-Text), not a text model with a bolted-on adapter.

Thinking Preservation — new feature that retains reasoning context from previous messages. Less overhead for iterative coding sessions.

262K native context — extensible beyond that.

Apache 2.0 license — fully open, commercial use allowed.

Hardware requirements

The beauty of MoE: your hardware only needs to handle the active parameters, not the total count.

Setup VRAM needed Expected speed
Q4_K_M quant ~6-8 GB 30+ tok/s on RTX 3060 12GB
Q8_0 quant ~12-14 GB 20+ tok/s on RTX 4070
FP8 (official) ~35 GB RTX 4090 or A6000
FP16 full ~70 GB Multi-GPU setup

If you can run a 7B model, you can run this. The 3B active parameter count is the number that matters for speed.

How to run it

Option 1: Ollama (easiest)

ollama run qwen3.6:35b-a3b
Enter fullscreen mode Exit fullscreen mode

Wait for GGUFs to appear — usually within hours of release. Check HuggingFace for the latest quantized versions.

Option 2: vLLM

pip install vllm
vllm serve Qwen/Qwen3.6-35B-A3B --tensor-parallel-size 1
Enter fullscreen mode Exit fullscreen mode

Option 3: Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.6-35B-A3B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")
Enter fullscreen mode Exit fullscreen mode

Option 4: Locally Uncensored (full GUI + model management)

If you want a clean desktop app that handles downloading, model management, and chatting in one place:

  1. Grab Locally Uncensored — it's open source (AGPL-3.0)
  2. v2.3.3 just shipped with day-0 Qwen3.6 support
  3. Download the model directly from the app, pick your quantization, and start chatting
  4. Vision works out of the box — drag and drop images into the chat
  5. The new Codex mode with live streaming is particularly nice for coding tasks with this model

LU also has agent mode with 13 tools, remote access from your phone, and a bunch of other stuff that pairs well with an agentic model like this one.

Who should care

  • Local AI coders — if you use AI for coding and want to run it locally, this is now the best MoE option. 73.4 SWE-bench with 3B active params is absurd.
  • Privacy-focused devs — Apache 2.0, runs on consumer hardware, no data leaves your machine.
  • Multimodal users — built-in vision means one model for text AND image understanding.
  • Anyone running Qwen3.5-35B-A3B — this is a straight upgrade. Same architecture class, better everything.

The bottom line

Qwen3.6-35B-A3B is what happens when you optimize MoE properly. 3B active parameters shouldn't be this good, but here we are. The coding benchmarks in particular are hard to argue with — 73.4 on SWE-bench Verified puts it in the same league as much larger, closed-source models.

Weights are on HuggingFace. FP8 variant here. GGUFs incoming.


Locally Uncensored is an open-source desktop app for running AI models locally with full privacy. Handles model downloads, chat, coding agents, image generation, and more. AGPL-3.0 licensed.

Top comments (0)