DEV Community

Randy AP
Randy AP

Posted on

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required

 Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is "get better hardware." I wanted to see if there was another way.
The idea is straightforward. Based on Apple’s landmark research paper, titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory" NVMe SSDs have gotten fast enough, PCIe Gen5 arrays are hitting ~56 GB/s, so you can treat them as a first-class memory tier for LLM inference instead of just storage. For Mixture-of-Experts models specifically, this is interesting because at any given token step, you only need 2 of 8 experts active. That's ~6GB of active weights on Mixtral 8x7B, not 24GB.
Micro-Expert-Router is the result. It's a Rust inference engine that streams MoE expert weights directly from NVMe via io_uring with O_DIRECT, routes tokens through real SwiGLU FFN kernels, and exposes an OpenAI-compatible HTTP API with SSE streaming.
What's in it:

SSD-streamed expert loading via io_uring fixed buffers and O_DIRECT pread
Multi-tier expert cache: SSD → RAM (LRU with pinning) → VRAM
Q4_0, Q4K, Q8_0, F16 quantization with AVX2/AVX-512/AMX dispatch
Speculative decoding with a draft engine tied to the main model embeddings
Continuous batching with weighted round-robin admission
SafeTensors loader, SIGHUP hot reload, TUI dashboard, Helm chart

Honest disclaimer on the numbers:
I don't have the hardware to run full benchmarks yet. The telemetry figures in the repo (11–15 tokens/sec across edge workstation, sovereign box, and RPC sharded cluster topologies) are theoretical ceilings derived from active weight footprint and raw NVMe sequential bandwidth at 80% cache hit rate — not measured results. Cold I/O latency projections range from 108ms on a Quad Gen5 U.2 array down to 1010ms on a PCIe Gen4 M.2. The closest published prior art is Apple's LLM in a Flash paper, this is an attempt at an open source runnable implementation of that idea.
The code is all there if you have the hardware to test it. I'd genuinely love to know if the projections hold.
GitHub: Micro Expert Router

Top comments (0)