<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Randy AP</title>
    <description>The latest articles on DEV Community by Randy AP (@randyap8wq).</description>
    <link>https://dev.to/randyap8wq</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3953463%2F3915f243-ce56-4d89-b38d-e4c122e88d1f.png</url>
      <title>DEV Community: Randy AP</title>
      <link>https://dev.to/randyap8wq</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/randyap8wq"/>
    <language>en</language>
    <item>
      <title>Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching</title>
      <dc:creator>Randy AP</dc:creator>
      <pubDate>Thu, 04 Jun 2026 19:21:38 +0000</pubDate>
      <link>https://dev.to/randyap8wq/running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-iouring-and-predictive-caching-50cd</link>
      <guid>https://dev.to/randyap8wq/running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-iouring-and-predictive-caching-50cd</guid>
      <description>&lt;p&gt;The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;Amalgafy Labs&lt;/strong&gt;, we built the &lt;strong&gt;Micro-Expert-Router (MER)&lt;/strong&gt; to challenge this assumption. &lt;/p&gt;

&lt;p&gt;We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.&lt;/p&gt;

&lt;p&gt;Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running &lt;strong&gt;Mixtral 8x7B (47B parameters, &lt;code&gt;q4_0&lt;/code&gt; quantization)&lt;/strong&gt; on a standard virtual machine utilizing &lt;strong&gt;pure host CPU execution&lt;/strong&gt;, the engine delivered a sustained &lt;strong&gt;21.38 Tokens Per Second (TPS)&lt;/strong&gt; over a massive 5,000-token context window.&lt;/p&gt;

&lt;p&gt;The full source code is now open-source on GitHub: &lt;a href="https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER" rel="noopener noreferrer"&gt;randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Evaluation Substrate
&lt;/h2&gt;

&lt;p&gt;The benchmark was executed inside an isolated virtual machine environment under strict compute constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compute Engine:&lt;/strong&gt; Pure Host CPU execution using native &lt;strong&gt;AVX-512 vector extensions&lt;/strong&gt;. &lt;em&gt;Zero active GPU VRAM or Tensor Cores were utilized for the Feed-Forward Network (FFN) layers.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Footprint:&lt;/strong&gt; Standard cloud instance profile allocated with &lt;strong&gt;128 GB of System RAM&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage Substrate:&lt;/strong&gt; Attached Local NVMe SSD bypassing standard OS file system overhead via kernel-level &lt;strong&gt;&lt;code&gt;io_uring&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;O_DIRECT&lt;/code&gt;&lt;/strong&gt; asynchronous queues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target Model:&lt;/strong&gt; Mixtral 8x7B (MoE architecture, 46.7B total parameters, Top-2 expert routing per token step).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision Format:&lt;/strong&gt; 4-bit quantization layout (&lt;code&gt;dtype=q4_0&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test Profile:&lt;/strong&gt; Continuous execution over &lt;strong&gt;5,000 tokens&lt;/strong&gt; (Seed: &lt;code&gt;12648430&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Raw Telemetry Logs
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
text
2026-06-04T15:10:41.520446Z  INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001
2026-06-04T15:10:41.520511Z  INFO ===================== run summary =====================
2026-06-04T15:10:41.520519Z  INFO experts:       256 (top-2), cache=256 slots, pool=258 slots
2026-06-04T15:10:41.520522Z  INFO ffn shape:      d_model=4096  d_ff=14336  bytes/expert=99090432 (dtype=q4_0)
2026-06-04T15:10:41.520534Z  INFO lookups:       hits=9746  misses=254  hit_rate=97.46%
2026-06-04T15:10:41.520540Z  INFO prefetches:    completed=2  predictor_observations=19996
2026-06-04T15:10:41.520546Z  INFO i/o:           reads=254  bytes=24193.00 MiB
2026-06-04T15:10:41.520557Z  INFO i/o latency:   p50=116543us  p95=233599us  p99=360191us
2026-06-04T15:10:41.520563Z  INFO compute:       p50=40255us  p95=41631us  p99=60735us  (SwiGLU FFN per token)
2026-06-04T15:10:41.520569Z  INFO cycle latency: p50=40287us  p95=42047us  p99=286975us  max=431615us
2026-06-04T15:10:41.520576Z  INFO per-token avg: io_wait=5772.7us  compute=40850.5us  (over 5000 tokens)
2026-06-04T15:10:41.520582Z  INFO I/O share:     12.37% of token cycle time spent waiting on SSD reads
2026-06-04T15:10:41.520588Z  INFO energy knobs:  dtype=q4_0  partial_load_fraction=1.00  pinned=0  alias_redirects=0
2026-06-04T15:10:41.520595Z  INFO =======================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>rust</category>
      <category>ai</category>
      <category>architecture</category>
      <category>performance</category>
    </item>
    <item>
      <title>I streamed Mixtral 8x7B from NVMe on a $0.40/hour VM and got 3.32 tps, here's how</title>
      <dc:creator>Randy AP</dc:creator>
      <pubDate>Mon, 01 Jun 2026 05:50:18 +0000</pubDate>
      <link>https://dev.to/randyap8wq/i-streamed-mixtral-8x7b-from-nvme-on-a-040hour-vm-and-got-332-tps-heres-how-19bf</link>
      <guid>https://dev.to/randyap8wq/i-streamed-mixtral-8x7b-from-nvme-on-a-040hour-vm-and-got-332-tps-heres-how-19bf</guid>
      <description>&lt;h1&gt;
  
  
  I streamed Mixtral 8x7B from NVMe on a $0.40/hour VM and got 3.32 tps — here's how
&lt;/h1&gt;

&lt;p&gt;Most people assume running Mixtral 8x7B requires an A100 with 80GB of VRAM. That's $2-3/hour minimum and most teams don't have access to it. I spent the last several months building MER, a Rust inference engine that takes a different approach: stream experts directly from NVMe on demand, cache the hot ones in RAM, and let the model be bigger than your memory.&lt;/p&gt;

&lt;p&gt;This is the first real benchmark on actual Mixtral 8x7B Instruct weights. Here's what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture in one paragraph
&lt;/h2&gt;

&lt;p&gt;MER uses O_DIRECT pread to stream per-expert weight files from NVMe into an LRU cache. A first-order Markov chain prefetcher learns routing patterns and pre-loads experts before they're needed. The expert dispatch uses AVX512 SwiGLU kernels on CPU with a GPU path via wgpu/CUDA in progress. The whole thing is Rust, BSL 1.1 licensed, and public on GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Model: Mixtral 8x7B Instruct, Q4_0 quantization, 34.6 GiB extracted experts&lt;/li&gt;
&lt;li&gt;Hardware: GCP g2-standard-8, 1x NVIDIA L4 GPU (CPU inference only — GPU path coming), local NVMe SSD&lt;/li&gt;
&lt;li&gt;Two runs: 500 tokens (burst) and 10,000 tokens (sustained)&lt;/li&gt;
&lt;li&gt;16 cache slots, O_DIRECT enabled, seed 12648430 (reproducible)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  500 token burst vs F32 baseline
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;F32 baseline&lt;/th&gt;
&lt;th&gt;Q4_0 burst&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sustained tps&lt;/td&gt;
&lt;td&gt;0.28&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.32&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache hit rate&lt;/td&gt;
&lt;td&gt;4.90%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.90%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;I/O latency p50&lt;/td&gt;
&lt;td&gt;2,654ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;172ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute p50&lt;/td&gt;
&lt;td&gt;668ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;111ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVMe throughput&lt;/td&gt;
&lt;td&gt;413 MiB/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;622 MiB/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run duration&lt;/td&gt;
&lt;td&gt;~21 min&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.5 min&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  10,000 token sustained run
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sustained tps&lt;/td&gt;
&lt;td&gt;2.69&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hit rate (steady state)&lt;/td&gt;
&lt;td&gt;15.56%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefetches completed&lt;/td&gt;
&lt;td&gt;6,661&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total NVMe read&lt;/td&gt;
&lt;td&gt;2.17 TiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;I/O latency p50&lt;/td&gt;
&lt;td&gt;281ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute p50&lt;/td&gt;
&lt;td&gt;111ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The hit rate stabilized at 15.56% across both runs confirming this is the true steady-state ceiling for 16 cache slots out of 256 experts. The compute p50 held rock solid at 111ms throughout — the variance is entirely in I/O, which is expected on virtualized storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the numbers mean
&lt;/h2&gt;

&lt;p&gt;The 12x tps improvement from F32 to Q4_0 comes entirely from smaller expert files — each expert shrinks from 672MB to 99MB, so each NVMe read completes 7x faster. The cache hit rate of 15.9% means roughly 1 in 6 expert loads was served from RAM with zero NVMe cost. The prefetcher completed 6,661 prefetches over 10k tokens — real work being done by the prediction system.&lt;/p&gt;

&lt;p&gt;The I/O share of 69% means the engine is genuinely NVMe-bound as designed. On a real PCIe 4.0 NVMe doing 7 GB/s instead of GCP's local SSD, I/O latency would drop significantly, pushing tps to 6-7 on CPU alone. With the GPU inference path wired, the theoretical ceiling on this same hardware is around 20 tps.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke during benchmarking and how we fixed it
&lt;/h2&gt;

&lt;p&gt;This wasn't a smooth run. Four bugs surfaced during testing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefetch threshold miscalibration.&lt;/strong&gt; The Markov prefetcher had a probability threshold calibrated for small expert pools. With 256 experts it was mathematically impossible to clear, so zero prefetches ever fired. Fixed by scaling the threshold relative to 1/num_experts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Buffer pool semaphore decoupled from headroom.&lt;/strong&gt; The semaphore allowed prefetch tasks to consume all available pool buffers and starve foreground fetches. Fixed by clamping permits to pool headroom minus one reserved slot for the critical path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spin-wait starvation.&lt;/strong&gt; The foreground fetch used a yield_now() spin loop that gave up in microseconds, losing the race against prefetch tasks holding buffers for multi-millisecond NVMe reads. Fixed by replacing the spin with a proper async wait on the pool's Notify.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q4_0 converter layout bug.&lt;/strong&gt; The GGUF converter's native pass-through only handled interleaved tensor layout. TheBloke's GGUFs use per-expert layout, causing silent fallback to F32. Fixed by extending the converter to handle both layouts.&lt;/p&gt;

&lt;p&gt;All fixes are in the public repo with full PR history.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The GPU inference path is the biggest pending item. The wgpu/CUDA backend exists and dispatches through the Backend trait, but needs the Q4_0 expert matmul wired through. On this same L4 VM with GPU compute, ~45ms I/O plus ~5ms GPU compute = roughly 20 tps theoretical ceiling.&lt;/p&gt;

&lt;p&gt;Native Q4_K_M support is also pending, currently falls back to F32 dequant because the mixed dtype situation isn't handled yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest position
&lt;/h2&gt;

&lt;p&gt;MER doesn't beat vLLM on an A100. It runs Mixtral 8x7B on hardware where it doesn't fit, at $0.40/hour, at 3.32 tps burst and 2.69 tps sustained today, with a clear path to 20+ tps once the GPU path is wired.&lt;/p&gt;

&lt;p&gt;If you're a solo developer, a startup with GPU budget constraints, or building data-sovereign infrastructure that can't send data to cloud APIs, that's a different tradeoff than most tools offer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER" rel="noopener noreferrer"&gt;https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you work on inference infrastructure and want to talk, open an issue or find me on LinkedIn.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>performance</category>
      <category>rust</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required</title>
      <dc:creator>Randy AP</dc:creator>
      <pubDate>Wed, 27 May 2026 03:32:21 +0000</pubDate>
      <link>https://dev.to/randyap8wq/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds-no-gpu-required-3bie</link>
      <guid>https://dev.to/randyap8wq/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds-no-gpu-required-3bie</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F989tsu88kkq3xvlco6w7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F989tsu88kkq3xvlco6w7.png" alt=" " width="800" height="638"&gt;&lt;/a&gt;Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is "get better hardware." I wanted to see if there was another way.&lt;br&gt;
The idea is straightforward. Based on Apple’s landmark research paper, titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory" NVMe SSDs have gotten fast enough, PCIe Gen5 arrays are hitting ~56 GB/s, so you can treat them as a first-class memory tier for LLM inference instead of just storage. For Mixture-of-Experts models specifically, this is interesting because at any given token step, you only need 2 of 8 experts active. That's ~6GB of active weights on Mixtral 8x7B, not 24GB.&lt;br&gt;
Micro-Expert-Router is the result. It's a Rust inference engine that streams MoE expert weights directly from NVMe via io_uring with O_DIRECT, routes tokens through real SwiGLU FFN kernels, and exposes an OpenAI-compatible HTTP API with SSE streaming.&lt;br&gt;
What's in it:&lt;/p&gt;

&lt;p&gt;SSD-streamed expert loading via io_uring fixed buffers and O_DIRECT pread&lt;br&gt;
Multi-tier expert cache: SSD → RAM (LRU with pinning) → VRAM&lt;br&gt;
Q4_0, Q4K, Q8_0, F16 quantization with AVX2/AVX-512/AMX dispatch&lt;br&gt;
Speculative decoding with a draft engine tied to the main model embeddings&lt;br&gt;
Continuous batching with weighted round-robin admission&lt;br&gt;
SafeTensors loader, SIGHUP hot reload, TUI dashboard, Helm chart&lt;/p&gt;

&lt;p&gt;Honest disclaimer on the numbers:&lt;br&gt;
I don't have the hardware to run full benchmarks yet. The telemetry figures in the repo (11–15 tokens/sec across edge workstation, sovereign box, and RPC sharded cluster topologies) are theoretical ceilings derived from active weight footprint and raw NVMe sequential bandwidth at 80% cache hit rate, not measured results. Cold I/O latency projections range from 108ms on a Quad Gen5 U.2 array down to 1010ms on a PCIe Gen4 M.2. The closest published prior art is Apple's LLM in a Flash paper, this is an attempt at an open source runnable implementation of that idea.&lt;br&gt;
The code is all there if you have the hardware to test it. I'd genuinely love to know if the projections hold.&lt;br&gt;
GitHub: &lt;a href="https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER" rel="noopener noreferrer"&gt;Micro Expert Router&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rust</category>
      <category>moe</category>
    </item>
  </channel>
</rss>
