The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.
At Amalgafy Labs, we built the Micro-Expert-Router (MER) to challenge this assumption.
We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.
Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B (47B parameters, q4_0 quantization) on a standard virtual machine utilizing pure host CPU execution, the engine delivered a sustained 21.38 Tokens Per Second (TPS) over a massive 5,000-token context window.
The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER.
The Evaluation Substrate
The benchmark was executed inside an isolated virtual machine environment under strict compute constraints:
- Compute Engine: Pure Host CPU execution using native AVX-512 vector extensions. Zero active GPU VRAM or Tensor Cores were utilized for the Feed-Forward Network (FFN) layers.
- Memory Footprint: Standard cloud instance profile allocated with 128 GB of System RAM.
-
Storage Substrate: Attached Local NVMe SSD bypassing standard OS file system overhead via kernel-level
io_uringandO_DIRECTasynchronous queues. - Target Model: Mixtral 8x7B (MoE architecture, 46.7B total parameters, Top-2 expert routing per token step).
-
Precision Format: 4-bit quantization layout (
dtype=q4_0). -
Test Profile: Continuous execution over 5,000 tokens (Seed:
12648430).
Raw Telemetry Logs
text
2026-06-04T15:10:41.520446Z INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001
2026-06-04T15:10:41.520511Z INFO ===================== run summary =====================
2026-06-04T15:10:41.520519Z INFO experts: 256 (top-2), cache=256 slots, pool=258 slots
2026-06-04T15:10:41.520522Z INFO ffn shape: d_model=4096 d_ff=14336 bytes/expert=99090432 (dtype=q4_0)
2026-06-04T15:10:41.520534Z INFO lookups: hits=9746 misses=254 hit_rate=97.46%
2026-06-04T15:10:41.520540Z INFO prefetches: completed=2 predictor_observations=19996
2026-06-04T15:10:41.520546Z INFO i/o: reads=254 bytes=24193.00 MiB
2026-06-04T15:10:41.520557Z INFO i/o latency: p50=116543us p95=233599us p99=360191us
2026-06-04T15:10:41.520563Z INFO compute: p50=40255us p95=41631us p99=60735us (SwiGLU FFN per token)
2026-06-04T15:10:41.520569Z INFO cycle latency: p50=40287us p95=42047us p99=286975us max=431615us
2026-06-04T15:10:41.520576Z INFO per-token avg: io_wait=5772.7us compute=40850.5us (over 5000 tokens)
2026-06-04T15:10:41.520582Z INFO I/O share: 12.37% of token cycle time spent waiting on SSD reads
2026-06-04T15:10:41.520588Z INFO energy knobs: dtype=q4_0 partial_load_fraction=1.00 pinned=0 alias_redirects=0
2026-06-04T15:10:41.520595Z INFO =======================================================
Top comments (0)