267 tok/s local inference on RTX 5090 – llama.cpp MTP + Qwen3-35B-A3B MoE

#llm #machinelearning #llama #gpu

Been running Qwen3-35B-A3B (MoE) with llama.cpp's Multi-Token Prediction
(MTP / speculative decoding) on an RTX 5090 under WSL2. Results surprised me:

Model	Speed
Ollama stock (35B MoE)	171 tok/s
27B Dense + MTP	104 tok/s
35B MoE + MTP	267 tok/s ← this

For context: Claude Haiku runs ~150 tok/s via API, billed at $150/MTok.
This setup runs on electricity only.

The interesting finding is that MoE and speculative decoding have unusual
synergy. With a dense model, MTP gave a modest speedup (or none).
With MoE, it nearly doubled throughput.

My hypothesis: MoE's sparse activation pattern leaves compute headroom that
speculative decoding can exploit. The draft tokens are cheap to verify because
most experts stay inactive during verification passes.

Setup: