Been running Qwen3-35B-A3B (MoE) with llama.cpp's Multi-Token Prediction
(MTP / speculative decoding) on an RTX 5090 under WSL2. Results surprised me:
| Model | Speed |
|---|---|
| Ollama stock (35B MoE) | 171 tok/s |
| 27B Dense + MTP | 104 tok/s |
| 35B MoE + MTP | 267 tok/s ← this |
For context: Claude Haiku runs ~150 tok/s via API, billed at $150/MTok.
This setup runs on electricity only.
The interesting finding is that MoE and speculative decoding have unusual
synergy. With a dense model, MTP gave a modest speedup (or none).
With MoE, it nearly doubled throughput.
My hypothesis: MoE's sparse activation pattern leaves compute headroom that
speculative decoding can exploit. The draft tokens are cheap to verify because
most experts stay inactive during verification passes.
Setup:
- RTX 5090, WSL2 (Ubuntu 24)
- llama.cpp with MTP draft, n-max 2
- Qwen3-35B-A3B-Instruct Q4_K_XL
- ctx 65536, OpenAI-compatible API on localhost
Happy to share the exact llama-server launch flags if anyone wants to reproduce.
Top comments (0)