Cross-posted from Best GPU for LLM — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.
79 tokens per second. That is what Qwen 3.6 35B-A3B delivers on an RTX 5070 Ti with 128K context — a number that would have been unthinkable for a 35B-class model a year ago. The secret is MoE: 35 billion total parameters, only ~3B active per token. It is the fastest high-quality model you can run on a 16GB card right now.
See the recommended pick on the original guide
Qwen 3.6 at a glance
| Spec | Value |
|---|---|
| Total parameters | 35B |
| Active parameters | ~3B per token |
| Architecture | Mixture of Experts (MoE) |
| Default context | 262K tokens |
| Recommended context | 128K for consumer GPUs |
| VRAM at Q4_K_M | ~13GB |
| VRAM at Q5_K_M | ~18GB |
| Ollama support | Yes (native) |
The 262K default context is impressive but aggressive for consumer hardware. We recommend capping at 128K unless you have 24GB+ VRAM — the KV cache for full 262K context can balloon past 20GB on its own.
VRAM requirements
| Quantization | Model weights | + KV cache (8K) | + KV cache (128K) |
|---|---|---|---|
| Q3_K_M | ~10GB | ~12GB | ~16GB |
| Q4_K_M | ~13GB | ~15GB | ~19GB |
| Q5_K_M | ~16GB | ~18GB | ~22GB |
| Q6_K | ~19GB | ~21GB | ~25GB |
| Q8_0 | ~25GB | ~27GB | ~31GB |
At Q4_K_M with 8K context, Qwen 3.6 fits cleanly on 16GB cards. Push context to 128K and you need 19GB+ — that means 24GB territory. This is the key VRAM planning decision: short context on 16GB, long context on 24GB.
VRAM chart available at the original article
Performance benchmarks
Tested with Ollama at Q4_K_M, 128K context window:
| GPU | tok/s | VRAM used | Fits? | Price |
|---|---|---|---|---|
| RTX 5090 (32GB) | ~120 tok/s | ~15GB | Yes | ~$2,000 |
| RTX 4090 (24GB) | ~85 tok/s | ~15GB | Yes | ~$1,600 |
| RTX 5080 (16GB) | ~68 tok/s | ~15GB | Tight* | ~$1,000 |
| RTX 5070 Ti (16GB) | ~79 tok/s | ~15GB | Tight* | ~$750 |
| RTX 4070 Ti Super (16GB) | ~55 tok/s | ~15GB | Tight* | ~$700 |
| RTX 4060 Ti 16GB | ~40 tok/s | ~15GB | Tight* | ~$400 |
| RTX 3090 (24GB, used) | ~60 tok/s | ~15GB | Yes | ~$600 |
| RTX 3060 12GB (used) | Won't fit | — | No | ~$150 |
*16GB cards fit Q4_K_M weights but leave minimal headroom for long context. Reduce context to 8K for stable operation, or use Q3_K_M for more breathing room.
The RTX 5070 Ti outperforms the RTX 5080 here — that is not a typo. GDDR7 bandwidth scales differently across SKUs, and the 5070 Ti's memory subsystem is well-optimized for MoE workloads where only a fraction of weights are accessed per token.
See the recommended pick on the original guide
MoE offloading: the 12GB GPU trick
Qwen 3.6 supports --n-cpu-moe in llama.cpp, which offloads inactive expert layers to system RAM. This means you can run it on GPUs with less than 13GB VRAM — the active ~3B parameters stay on the GPU, while dormant experts live in CPU memory.
Performance drops significantly (expect 15-25 tok/s on an RTX 3060 12GB with DDR5 RAM), but it works. This is a viable path for experimentation, though not recommended for daily use.
Which GPU should you buy?
GPU tier list available at the original article
Best value: RTX 5070 Ti ($750). 79 tok/s at Q4 is blazing fast for a $750 card. Handles Qwen 3.6 at reduced context with no issues. This is our top recommendation for most users.
Best for long context: RTX 4090 ($1,600). 24GB VRAM means you can run 128K context at Q4 without worrying about OOM. 85 tok/s — smooth and fast.
Budget pick: RTX 4060 Ti 16GB ($400). 40 tok/s at Q4 with short context. Usable for chat, slower for long-form generation. A solid entry point if $750 is too much.
Used market: RTX 3090 ($600). 24GB VRAM, 60 tok/s. Older architecture but the VRAM headroom means full 128K context at Q4. The best price-per-VRAM option on the market.
12GB GPUs: Not recommended. Qwen 3.6 needs 13GB at Q4. MoE offloading works but the performance penalty makes it impractical for regular use.
Common mistakes
- Using 262K context on 16GB cards. The default context is 262K, but the KV cache alone would need ~25GB. Cap at 8K-16K on 16GB cards, or 128K on 24GB cards.
- Comparing Qwen 3.6 to dense 35B models. The 3B active parameters mean it runs 3-4x faster than a dense 35B. Do not use dense model benchmarks to estimate Qwen 3.6 performance.
- Skipping Q4_K_M for Q8. Q8 pushes the model to ~25GB — out of reach for 16GB and tight on 24GB cards. The quality difference between Q4 and Q8 on MoE models is smaller than on dense models because the active weights are a small fraction of total parameters.
Final verdict
| Your goal | Best GPU | Price |
|---|---|---|
| Qwen 3.6 daily driver | RTX 5070 Ti | ~$750 |
| Long context (128K) | RTX 4090 | ~$1,600 |
| Tightest budget | RTX 4060 Ti 16GB | ~$400 |
| Best used market value | RTX 3090 | ~$600 |
See the recommended pick on the original guide
For the previous generation, see our best GPU for Qwen 3 guide. Running models through Ollama? The best GPU for Ollama article covers multi-model setups. For VRAM planning across all model sizes, how much VRAM for local LLM has the full reference. And if you want to understand quantization trade-offs in depth, read best quantization for local LLM.
Related guides on Best GPU for LLM
- Best GPU for Qwen Models in 2026 (Qwen 3 + 3.6 Picks)
- Best Budget GPU for Local LLM in 2026 (Under $350)
- Best GPU for Continue.dev (Local AI Coding) in 2026
Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.
Top comments (0)