I streamed Mixtral 8x7B from NVMe on a $0.40/hour VM and got 3.32 tps — here's how
Most people assume running Mixtral 8x7B requires an A100 with 80GB of VRAM. That's $2-3/hour minimum and most teams don't have access to it. I spent the last several months building MER, a Rust inference engine that takes a different approach: stream experts directly from NVMe on demand, cache the hot ones in RAM, and let the model be bigger than your memory.
This is the first real benchmark on actual Mixtral 8x7B Instruct weights. Here's what happened.
The architecture in one paragraph
MER uses O_DIRECT pread to stream per-expert weight files from NVMe into an LRU cache. A first-order Markov chain prefetcher learns routing patterns and pre-loads experts before they're needed. The expert dispatch uses AVX512 SwiGLU kernels on CPU with a GPU path via wgpu/CUDA in progress. The whole thing is Rust, BSL 1.1 licensed, and public on GitHub.
The benchmark setup
- Model: Mixtral 8x7B Instruct, Q4_0 quantization, 34.6 GiB extracted experts
- Hardware: GCP g2-standard-8, 1x NVIDIA L4 GPU (CPU inference only — GPU path coming), local NVMe SSD
- Two runs: 500 tokens (burst) and 10,000 tokens (sustained)
- 16 cache slots, O_DIRECT enabled, seed 12648430 (reproducible)
The numbers
500 token burst vs F32 baseline
| Metric | F32 baseline | Q4_0 burst |
|---|---|---|
| Sustained tps | 0.28 | 3.32 |
| Cache hit rate | 4.90% | 15.90% |
| I/O latency p50 | 2,654ms | 172ms |
| Compute p50 | 668ms | 111ms |
| NVMe throughput | 413 MiB/s | 622 MiB/s |
| Run duration | ~21 min | 2.5 min |
10,000 token sustained run
| Metric | Value |
|---|---|
| Sustained tps | 2.69 |
| Hit rate (steady state) | 15.56% |
| Prefetches completed | 6,661 |
| Total NVMe read | 2.17 TiB |
| I/O latency p50 | 281ms |
| Compute p50 | 111ms |
The hit rate stabilized at 15.56% across both runs confirming this is the true steady-state ceiling for 16 cache slots out of 256 experts. The compute p50 held rock solid at 111ms throughout — the variance is entirely in I/O, which is expected on virtualized storage.
What the numbers mean
The 12x tps improvement from F32 to Q4_0 comes entirely from smaller expert files — each expert shrinks from 672MB to 99MB, so each NVMe read completes 7x faster. The cache hit rate of 15.9% means roughly 1 in 6 expert loads was served from RAM with zero NVMe cost. The prefetcher completed 6,661 prefetches over 10k tokens — real work being done by the prediction system.
The I/O share of 69% means the engine is genuinely NVMe-bound as designed. On a real PCIe 4.0 NVMe doing 7 GB/s instead of GCP's local SSD, I/O latency would drop significantly, pushing tps to 6-7 on CPU alone. With the GPU inference path wired, the theoretical ceiling on this same hardware is around 20 tps.
What broke during benchmarking and how we fixed it
This wasn't a smooth run. Four bugs surfaced during testing:
Prefetch threshold miscalibration. The Markov prefetcher had a probability threshold calibrated for small expert pools. With 256 experts it was mathematically impossible to clear, so zero prefetches ever fired. Fixed by scaling the threshold relative to 1/num_experts.
Buffer pool semaphore decoupled from headroom. The semaphore allowed prefetch tasks to consume all available pool buffers and starve foreground fetches. Fixed by clamping permits to pool headroom minus one reserved slot for the critical path.
Spin-wait starvation. The foreground fetch used a yield_now() spin loop that gave up in microseconds, losing the race against prefetch tasks holding buffers for multi-millisecond NVMe reads. Fixed by replacing the spin with a proper async wait on the pool's Notify.
Q4_0 converter layout bug. The GGUF converter's native pass-through only handled interleaved tensor layout. TheBloke's GGUFs use per-expert layout, causing silent fallback to F32. Fixed by extending the converter to handle both layouts.
All fixes are in the public repo with full PR history.
What's next
The GPU inference path is the biggest pending item. The wgpu/CUDA backend exists and dispatches through the Backend trait, but needs the Q4_0 expert matmul wired through. On this same L4 VM with GPU compute, ~45ms I/O plus ~5ms GPU compute = roughly 20 tps theoretical ceiling.
Native Q4_K_M support is also pending, currently falls back to F32 dequant because the mixed dtype situation isn't handled yet.
The honest position
MER doesn't beat vLLM on an A100. It runs Mixtral 8x7B on hardware where it doesn't fit, at $0.40/hour, at 3.32 tps burst and 2.69 tps sustained today, with a clear path to 20+ tps once the GPU path is wired.
If you're a solo developer, a startup with GPU budget constraints, or building data-sovereign infrastructure that can't send data to cloud APIs, that's a different tradeoff than most tools offer.
Repo: https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER
If you work on inference infrastructure and want to talk, open an issue or find me on LinkedIn.
Top comments (0)