DEV Community

gen
gen

Posted on

267 tok/s local inference on RTX 5090 – llama.cpp MTP + Qwen3-35B-A3B MoE

Been running Qwen3-35B-A3B (MoE) with llama.cpp's Multi-Token Prediction
(MTP / speculative decoding) on an RTX 5090 under WSL2. Results surprised me:

Model Speed
Ollama stock (35B MoE) 171 tok/s
27B Dense + MTP 104 tok/s
35B MoE + MTP 267 tok/s ← this

For context: Claude Haiku runs ~150 tok/s via API, billed at $150/MTok.
This setup runs on electricity only.

The interesting finding is that MoE and speculative decoding have unusual
synergy. With a dense model, MTP gave a modest speedup (or none).
With MoE, it nearly doubled throughput.

My hypothesis: MoE's sparse activation pattern leaves compute headroom that
speculative decoding can exploit. The draft tokens are cheap to verify because
most experts stay inactive during verification passes.

Setup:

  • RTX 5090, WSL2 (Ubuntu 24)
  • llama.cpp with MTP draft, n-max 2
  • Qwen3-35B-A3B-Instruct Q4_K_XL
  • ctx 65536, OpenAI-compatible API on localhost

Happy to share the exact llama-server launch flags if anyone wants to reproduce.

Top comments (0)