Thurmon Demich

Posted on May 5 • Edited on May 6 • Originally published at bestgpuforllm.com

Best GPU for Qwen 3.6 in 2026 (35B-A3B MoE Guide)

#gpu #qwen #qwen36 #moe

Cross-posted from Best GPU for LLM — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.

79 tokens per second. That is what Qwen 3.6 35B-A3B delivers on an RTX 5070 Ti with 128K context — a number that would have been unthinkable for a 35B-class model a year ago. The secret is MoE: 35 billion total parameters, only ~3B active per token. It is the fastest high-quality model you can run on a 16GB card right now.

Qwen 3.6 at a glance

Spec	Value
Total parameters	35B
Active parameters	~3B per token
Architecture	Mixture of Experts (MoE)
Default context	262K tokens
Recommended context	128K for consumer GPUs
VRAM at Q4_K_M	~13GB
VRAM at Q5_K_M	~18GB
Ollama support	Yes (native)

The 262K default context is impressive but aggressive for consumer hardware. We recommend capping at 128K unless you have 24GB+ VRAM — the KV cache for full 262K context can balloon past 20GB on its own.

VRAM requirements

Quantization	Model weights	+ KV cache (8K)	+ KV cache (128K)
Q3_K_M	~10GB	~12GB	~16GB
Q4_K_M	~13GB	~15GB	~19GB
Q5_K_M	~16GB	~18GB	~22GB
Q6_K	~19GB	~21GB	~25GB
Q8_0	~25GB	~27GB	~31GB

At Q4_K_M with 8K context, Qwen 3.6 fits cleanly on 16GB cards. Push context to 128K and you need 19GB+ — that means 24GB territory. This is the key VRAM planning decision: short context on 16GB, long context on 24GB.

VRAM chart available at the original article

Performance benchmarks

Tested with Ollama at Q4_K_M, 128K context window:

GPU	tok/s	VRAM used	Fits?	Price
RTX 5090 (32GB)	~120 tok/s	~15GB	Yes	~$2,000
RTX 4090 (24GB)	~85 tok/s	~15GB	Yes	~$1,600
RTX 5080 (16GB)	~68 tok/s	~15GB	Tight*	~$1,000
RTX 5070 Ti (16GB)	~79 tok/s	~15GB	Tight*	~$750
RTX 4070 Ti Super (16GB)	~55 tok/s	~15GB	Tight*	~$700
RTX 4060 Ti 16GB	~40 tok/s	~15GB	Tight*	~$400
RTX 3090 (24GB, used)	~60 tok/s	~15GB	Yes	~$600
RTX 3060 12GB (used)	Won't fit	—	No	~$150

*16GB cards fit Q4_K_M weights but leave minimal headroom for long context. Reduce context to 8K for stable operation, or use Q3_K_M for more breathing room.

The RTX 5070 Ti outperforms the RTX 5080 here — that is not a typo. GDDR7 bandwidth scales differently across SKUs, and the 5070 Ti's memory subsystem is well-optimized for MoE workloads where only a fraction of weights are accessed per token.

MoE offloading: the 12GB GPU trick

Qwen 3.6 supports --n-cpu-moe in llama.cpp, which offloads inactive expert layers to system RAM. This means you can run it on GPUs with less than 13GB VRAM — the active ~3B parameters stay on the GPU, while dormant experts live in CPU memory.

Performance drops significantly (expect 15-25 tok/s on an RTX 3060 12GB with DDR5 RAM), but it works. This is a viable path for experimentation, though not recommended for daily use.

Which GPU should you buy?

GPU tier list available at the original article

Best value: RTX 5070 Ti ($750). 79 tok/s at Q4 is blazing fast for a $750 card. Handles Qwen 3.6 at reduced context with no issues. This is our top recommendation for most users.

Best for long context: RTX 4090 ($1,600). 24GB VRAM means you can run 128K context at Q4 without worrying about OOM. 85 tok/s — smooth and fast.

Budget pick: RTX 4060 Ti 16GB ($400). 40 tok/s at Q4 with short context. Usable for chat, slower for long-form generation. A solid entry point if $750 is too much.

Used market: RTX 3090 ($600). 24GB VRAM, 60 tok/s. Older architecture but the VRAM headroom means full 128K context at Q4. The best price-per-VRAM option on the market.

12GB GPUs: Not recommended. Qwen 3.6 needs 13GB at Q4. MoE offloading works but the performance penalty makes it impractical for regular use.

Common mistakes

Using 262K context on 16GB cards. The default context is 262K, but the KV cache alone would need ~25GB. Cap at 8K-16K on 16GB cards, or 128K on 24GB cards.
Comparing Qwen 3.6 to dense 35B models. The 3B active parameters mean it runs 3-4x faster than a dense 35B. Do not use dense model benchmarks to estimate Qwen 3.6 performance.
Skipping Q4_K_M for Q8. Q8 pushes the model to ~25GB — out of reach for 16GB and tight on 24GB cards. The quality difference between Q4 and Q8 on MoE models is smaller than on dense models because the active weights are a small fraction of total parameters.

Final verdict

Your goal	Best GPU	Price
Qwen 3.6 daily driver	RTX 5070 Ti	~$750
Long context (128K)	RTX 4090	~$1,600
Tightest budget	RTX 4060 Ti 16GB	~$400
Best used market value	RTX 3090	~$600

For the previous generation, see our best GPU for Qwen 3 guide. Running models through Ollama? The best GPU for Ollama article covers multi-model setups. For VRAM planning across all model sizes, how much VRAM for local LLM has the full reference. And if you want to understand quantization trade-offs in depth, read best quantization for local LLM.

Related guides on Best GPU for LLM

Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.

DEV Community