<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Randy AP</title>
    <description>The latest articles on DEV Community by Randy AP (@randyap8wq).</description>
    <link>https://dev.to/randyap8wq</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3953463%2F3915f243-ce56-4d89-b38d-e4c122e88d1f.png</url>
      <title>DEV Community: Randy AP</title>
      <link>https://dev.to/randyap8wq</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/randyap8wq"/>
    <language>en</language>
    <item>
      <title>I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required</title>
      <dc:creator>Randy AP</dc:creator>
      <pubDate>Wed, 27 May 2026 03:32:21 +0000</pubDate>
      <link>https://dev.to/randyap8wq/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds-no-gpu-required-3bie</link>
      <guid>https://dev.to/randyap8wq/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds-no-gpu-required-3bie</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F989tsu88kkq3xvlco6w7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F989tsu88kkq3xvlco6w7.png" alt=" " width="800" height="638"&gt;&lt;/a&gt;Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is "get better hardware." I wanted to see if there was another way.&lt;br&gt;
The idea is straightforward. Based on Apple’s landmark research paper, titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory" NVMe SSDs have gotten fast enough, PCIe Gen5 arrays are hitting ~56 GB/s, so you can treat them as a first-class memory tier for LLM inference instead of just storage. For Mixture-of-Experts models specifically, this is interesting because at any given token step, you only need 2 of 8 experts active. That's ~6GB of active weights on Mixtral 8x7B, not 24GB.&lt;br&gt;
Micro-Expert-Router is the result. It's a Rust inference engine that streams MoE expert weights directly from NVMe via io_uring with O_DIRECT, routes tokens through real SwiGLU FFN kernels, and exposes an OpenAI-compatible HTTP API with SSE streaming.&lt;br&gt;
What's in it:&lt;/p&gt;

&lt;p&gt;SSD-streamed expert loading via io_uring fixed buffers and O_DIRECT pread&lt;br&gt;
Multi-tier expert cache: SSD → RAM (LRU with pinning) → VRAM&lt;br&gt;
Q4_0, Q4K, Q8_0, F16 quantization with AVX2/AVX-512/AMX dispatch&lt;br&gt;
Speculative decoding with a draft engine tied to the main model embeddings&lt;br&gt;
Continuous batching with weighted round-robin admission&lt;br&gt;
SafeTensors loader, SIGHUP hot reload, TUI dashboard, Helm chart&lt;/p&gt;

&lt;p&gt;Honest disclaimer on the numbers:&lt;br&gt;
I don't have the hardware to run full benchmarks yet. The telemetry figures in the repo (11–15 tokens/sec across edge workstation, sovereign box, and RPC sharded cluster topologies) are theoretical ceilings derived from active weight footprint and raw NVMe sequential bandwidth at 80% cache hit rate, not measured results. Cold I/O latency projections range from 108ms on a Quad Gen5 U.2 array down to 1010ms on a PCIe Gen4 M.2. The closest published prior art is Apple's LLM in a Flash paper, this is an attempt at an open source runnable implementation of that idea.&lt;br&gt;
The code is all there if you have the hardware to test it. I'd genuinely love to know if the projections hold.&lt;br&gt;
GitHub: &lt;a href="https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER" rel="noopener noreferrer"&gt;Micro Expert Router&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rust</category>
      <category>moe</category>
    </item>
  </channel>
</rss>
