Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

#rust #ai #architecture #performance

The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.

At Amalgafy Labs, we built the Micro-Expert-Router (MER) to challenge this assumption.

We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.

Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B (47B parameters, q4_0 quantization) on a standard virtual machine utilizing pure host CPU execution, the engine delivered a sustained 21.38 Tokens Per Second (TPS) over a massive 5,000-token context window.

The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER.

The Evaluation Substrate

The benchmark was executed inside an isolated virtual machine environment under strict compute constraints:

Compute Engine: Pure Host CPU execution using native AVX-512 vector extensions. Zero active GPU VRAM or Tensor Cores were utilized for the Feed-Forward Network (FFN) layers.
Memory Footprint: Standard cloud instance profile allocated with 128 GB of System RAM.
Storage Substrate: Attached Local NVMe SSD bypassing standard OS file system overhead via kernel-level io_uring and O_DIRECT asynchronous queues.
Target Model: Mixtral 8x7B (MoE architecture, 46.7B total parameters, Top-2 expert routing per token step).
Precision Format: 4-bit quantization layout (dtype=q4_0).
Test Profile: Continuous execution over 5,000 tokens (Seed: 12648430).

Raw Telemetry Logs


text
2026-06-04T15:10:41.520446Z  INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001
2026-06-04T15:10:41.520511Z  INFO ===================== run summary =====================
2026-06-04T15:10:41.520519Z  INFO experts:       256 (top-2), cache=256 slots, pool=258 slots
2026-06-04T15:10:41.520522Z  INFO ffn shape:      d_model=4096  d_ff=14336  bytes/expert=99090432 (dtype=q4_0)
2026-06-04T15:10:41.520534Z  INFO lookups:       hits=9746  misses=254  hit_rate=97.46%
2026-06-04T15:10:41.520540Z  INFO prefetches:    completed=2  predictor_observations=19996
2026-06-04T15:10:41.520546Z  INFO i/o:           reads=254  bytes=24193.00 MiB
2026-06-04T15:10:41.520557Z  INFO i/o latency:   p50=116543us  p95=233599us  p99=360191us
2026-06-04T15:10:41.520563Z  INFO compute:       p50=40255us  p95=41631us  p99=60735us  (SwiGLU FFN per token)
2026-06-04T15:10:41.520569Z  INFO cycle latency: p50=40287us  p95=42047us  p99=286975us  max=431615us
2026-06-04T15:10:41.520576Z  INFO per-token avg: io_wait=5772.7us  compute=40850.5us  (over 5000 tokens)
2026-06-04T15:10:41.520582Z  INFO I/O share:     12.37% of token cycle time spent waiting on SSD reads
2026-06-04T15:10:41.520588Z  INFO energy knobs:  dtype=q4_0  partial_load_fraction=1.00  pinned=0  alias_redirects=0
2026-06-04T15:10:41.520595Z  INFO =======================================================

Top comments (1)

mote • Jun 10

97.46% cache hit rate with only 256 slots for 256 experts is wild. Are you using LRU or something smarter? With top-2 routing, the access pattern should be Zipfian — a simple frequency counter might beat LRU since expert popularity doesn't shift intra-sequence.

The io_uring + O_DIRECT choice vs llama.cpp's mmap is the interesting tradeoff here. mmap gives you zero-copy reads when pages are hot, but page faults under memory pressure kill latency predictability. Your O_DIRECT approach gives you deterministic I/O latency at the cost of explicit buffer management. For 128GB RAM vs a 99MB-per-expert model, you had headroom — any particular reason you went with direct I/O instead of just trusting the page cache?

One thing I'd love to see benchmarked: q8_0 on the same hardware. The FFN compute is 87.6% of latency, and AVX-512 throughput doubles with 8-bit vs 4-bit — the question is whether the doubled I/O offsets the compute gain.