DEV Community: Randy AP

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Randy AP — Thu, 04 Jun 2026 19:21:38 +0000

The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.

At Amalgafy Labs, we built the Micro-Expert-Router (MER) to challenge this assumption.

We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.

Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B (47B parameters, q4_0 quantization) on a standard virtual machine utilizing pure host CPU execution, the engine delivered a sustained 21.38 Tokens Per Second (TPS) over a massive 5,000-token context window.

The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER.

The Evaluation Substrate

The benchmark was executed inside an isolated virtual machine environment under strict compute constraints:

Compute Engine: Pure Host CPU execution using native AVX-512 vector extensions. Zero active GPU VRAM or Tensor Cores were utilized for the Feed-Forward Network (FFN) layers.
Memory Footprint: Standard cloud instance profile allocated with 128 GB of System RAM.
Storage Substrate: Attached Local NVMe SSD bypassing standard OS file system overhead via kernel-level io_uring and O_DIRECT asynchronous queues.
Target Model: Mixtral 8x7B (MoE architecture, 46.7B total parameters, Top-2 expert routing per token step).
Precision Format: 4-bit quantization layout (dtype=q4_0).
Test Profile: Continuous execution over 5,000 tokens (Seed: 12648430).

Raw Telemetry Logs


text
2026-06-04T15:10:41.520446Z  INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001
2026-06-04T15:10:41.520511Z  INFO ===================== run summary =====================
2026-06-04T15:10:41.520519Z  INFO experts:       256 (top-2), cache=256 slots, pool=258 slots
2026-06-04T15:10:41.520522Z  INFO ffn shape:      d_model=4096  d_ff=14336  bytes/expert=99090432 (dtype=q4_0)
2026-06-04T15:10:41.520534Z  INFO lookups:       hits=9746  misses=254  hit_rate=97.46%
2026-06-04T15:10:41.520540Z  INFO prefetches:    completed=2  predictor_observations=19996
2026-06-04T15:10:41.520546Z  INFO i/o:           reads=254  bytes=24193.00 MiB
2026-06-04T15:10:41.520557Z  INFO i/o latency:   p50=116543us  p95=233599us  p99=360191us
2026-06-04T15:10:41.520563Z  INFO compute:       p50=40255us  p95=41631us  p99=60735us  (SwiGLU FFN per token)
2026-06-04T15:10:41.520569Z  INFO cycle latency: p50=40287us  p95=42047us  p99=286975us  max=431615us
2026-06-04T15:10:41.520576Z  INFO per-token avg: io_wait=5772.7us  compute=40850.5us  (over 5000 tokens)
2026-06-04T15:10:41.520582Z  INFO I/O share:     12.37% of token cycle time spent waiting on SSD reads
2026-06-04T15:10:41.520588Z  INFO energy knobs:  dtype=q4_0  partial_load_fraction=1.00  pinned=0  alias_redirects=0
2026-06-04T15:10:41.520595Z  INFO =======================================================

I streamed Mixtral 8x7B from NVMe on a $0.40/hour VM and got 3.32 tps, here's how

Randy AP — Mon, 01 Jun 2026 05:50:18 +0000

I streamed Mixtral 8x7B from NVMe on a $0.40/hour VM and got 3.32 tps — here's how

Most people assume running Mixtral 8x7B requires an A100 with 80GB of VRAM. That's $2-3/hour minimum and most teams don't have access to it. I spent the last several months building MER, a Rust inference engine that takes a different approach: stream experts directly from NVMe on demand, cache the hot ones in RAM, and let the model be bigger than your memory.

This is the first real benchmark on actual Mixtral 8x7B Instruct weights. Here's what happened.

The architecture in one paragraph

MER uses O_DIRECT pread to stream per-expert weight files from NVMe into an LRU cache. A first-order Markov chain prefetcher learns routing patterns and pre-loads experts before they're needed. The expert dispatch uses AVX512 SwiGLU kernels on CPU with a GPU path via wgpu/CUDA in progress. The whole thing is Rust, BSL 1.1 licensed, and public on GitHub.

The benchmark setup

Model: Mixtral 8x7B Instruct, Q4_0 quantization, 34.6 GiB extracted experts
Hardware: GCP g2-standard-8, 1x NVIDIA L4 GPU (CPU inference only — GPU path coming), local NVMe SSD
Two runs: 500 tokens (burst) and 10,000 tokens (sustained)
16 cache slots, O_DIRECT enabled, seed 12648430 (reproducible)

The numbers

500 token burst vs F32 baseline

Metric	F32 baseline	Q4_0 burst
Sustained tps	0.28	3.32
Cache hit rate	4.90%	15.90%
I/O latency p50	2,654ms	172ms
Compute p50	668ms	111ms
NVMe throughput	413 MiB/s	622 MiB/s
Run duration	~21 min	2.5 min

10,000 token sustained run

Metric	Value
Sustained tps	2.69
Hit rate (steady state)	15.56%
Prefetches completed	6,661
Total NVMe read	2.17 TiB
I/O latency p50	281ms
Compute p50	111ms

The hit rate stabilized at 15.56% across both runs confirming this is the true steady-state ceiling for 16 cache slots out of 256 experts. The compute p50 held rock solid at 111ms throughout — the variance is entirely in I/O, which is expected on virtualized storage.

What the numbers mean

The 12x tps improvement from F32 to Q4_0 comes entirely from smaller expert files — each expert shrinks from 672MB to 99MB, so each NVMe read completes 7x faster. The cache hit rate of 15.9% means roughly 1 in 6 expert loads was served from RAM with zero NVMe cost. The prefetcher completed 6,661 prefetches over 10k tokens — real work being done by the prediction system.

The I/O share of 69% means the engine is genuinely NVMe-bound as designed. On a real PCIe 4.0 NVMe doing 7 GB/s instead of GCP's local SSD, I/O latency would drop significantly, pushing tps to 6-7 on CPU alone. With the GPU inference path wired, the theoretical ceiling on this same hardware is around 20 tps.

What broke during benchmarking and how we fixed it

This wasn't a smooth run. Four bugs surfaced during testing:

Prefetch threshold miscalibration. The Markov prefetcher had a probability threshold calibrated for small expert pools. With 256 experts it was mathematically impossible to clear, so zero prefetches ever fired. Fixed by scaling the threshold relative to 1/num_experts.

Buffer pool semaphore decoupled from headroom. The semaphore allowed prefetch tasks to consume all available pool buffers and starve foreground fetches. Fixed by clamping permits to pool headroom minus one reserved slot for the critical path.

Spin-wait starvation. The foreground fetch used a yield_now() spin loop that gave up in microseconds, losing the race against prefetch tasks holding buffers for multi-millisecond NVMe reads. Fixed by replacing the spin with a proper async wait on the pool's Notify.

Q4_0 converter layout bug. The GGUF converter's native pass-through only handled interleaved tensor layout. TheBloke's GGUFs use per-expert layout, causing silent fallback to F32. Fixed by extending the converter to handle both layouts.

All fixes are in the public repo with full PR history.

What's next

The GPU inference path is the biggest pending item. The wgpu/CUDA backend exists and dispatches through the Backend trait, but needs the Q4_0 expert matmul wired through. On this same L4 VM with GPU compute, ~45ms I/O plus ~5ms GPU compute = roughly 20 tps theoretical ceiling.

Native Q4_K_M support is also pending, currently falls back to F32 dequant because the mixed dtype situation isn't handled yet.

The honest position

MER doesn't beat vLLM on an A100. It runs Mixtral 8x7B on hardware where it doesn't fit, at $0.40/hour, at 3.32 tps burst and 2.69 tps sustained today, with a clear path to 20+ tps once the GPU path is wired.

If you're a solo developer, a startup with GPU budget constraints, or building data-sovereign infrastructure that can't send data to cloud APIs, that's a different tradeoff than most tools offer.

Repo: https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER

If you work on inference infrastructure and want to talk, open an issue or find me on LinkedIn.

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required

Randy AP — Wed, 27 May 2026 03:32:21 +0000

Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is "get better hardware." I wanted to see if there was another way.
The idea is straightforward. Based on Apple’s landmark research paper, titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory" NVMe SSDs have gotten fast enough, PCIe Gen5 arrays are hitting ~56 GB/s, so you can treat them as a first-class memory tier for LLM inference instead of just storage. For Mixture-of-Experts models specifically, this is interesting because at any given token step, you only need 2 of 8 experts active. That's ~6GB of active weights on Mixtral 8x7B, not 24GB.
Micro-Expert-Router is the result. It's a Rust inference engine that streams MoE expert weights directly from NVMe via io_uring with O_DIRECT, routes tokens through real SwiGLU FFN kernels, and exposes an OpenAI-compatible HTTP API with SSE streaming.
What's in it:

SSD-streamed expert loading via io_uring fixed buffers and O_DIRECT pread
Multi-tier expert cache: SSD → RAM (LRU with pinning) → VRAM
Q4_0, Q4K, Q8_0, F16 quantization with AVX2/AVX-512/AMX dispatch
Speculative decoding with a draft engine tied to the main model embeddings
Continuous batching with weighted round-robin admission
SafeTensors loader, SIGHUP hot reload, TUI dashboard, Helm chart

Honest disclaimer on the numbers:
I don't have the hardware to run full benchmarks yet. The telemetry figures in the repo (11–15 tokens/sec across edge workstation, sovereign box, and RPC sharded cluster topologies) are theoretical ceilings derived from active weight footprint and raw NVMe sequential bandwidth at 80% cache hit rate, not measured results. Cold I/O latency projections range from 108ms on a Quad Gen5 U.2 array down to 1010ms on a PCIe Gen4 M.2. The closest published prior art is Apple's LLM in a Flash paper, this is an attempt at an open source runnable implementation of that idea.
The code is all there if you have the hardware to test it. I'd genuinely love to know if the projections hold.
GitHub: Micro Expert Router