<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: CARLOS ENRIQUE CASTRO LAZARO</title>
    <description>The latest articles on DEV Community by CARLOS ENRIQUE CASTRO LAZARO (@onceupontry).</description>
    <link>https://dev.to/onceupontry</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3883934%2Fda35d7a9-88c6-4a1e-b302-6efbdb77d253.png</url>
      <title>DEV Community: CARLOS ENRIQUE CASTRO LAZARO</title>
      <link>https://dev.to/onceupontry</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/onceupontry"/>
    <language>en</language>
    <item>
      <title>RAGE-QUANT: 3x Faster LLM Inference on CPU with Pure Rust Quantized GEMV</title>
      <dc:creator>CARLOS ENRIQUE CASTRO LAZARO</dc:creator>
      <pubDate>Fri, 17 Apr 2026 08:10:28 +0000</pubDate>
      <link>https://dev.to/onceupontry/rage-quant-3x-faster-llm-inference-on-cpu-with-pure-rust-quantized-gemv-1hdn</link>
      <guid>https://dev.to/onceupontry/rage-quant-3x-faster-llm-inference-on-cpu-with-pure-rust-quantized-gemv-1hdn</guid>
      <description>&lt;p&gt;&lt;strong&gt;Skip dequantization. Save 57% RAM. Get 3x faster decode. No GPU required.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Every LLM framework (llama.cpp, candle, burn) does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GGUF quantized weights → dequantize to f32 → f32 GEMV → result
             4x DRAM bandwidth wasted ^     ^ 3.2 GB RAM for dense cache
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RAGE-QUANT does this instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GGUF quantized weights → quantized GEMV → result
         reads 1.06 bytes/element instead of 4 bytes = 3.76x less DRAM traffic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;No dequantization step. No f32 cache. 57% less RAM. 3x faster decode.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Real Benchmarks (not theoretical)
&lt;/h2&gt;

&lt;p&gt;Tested on &lt;strong&gt;Qwen3-0.6B-Q8_0.gguf&lt;/strong&gt; | CPU-only | AMD Ryzen 9 9900X | 12 threads&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What we measured&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decode latency per token&lt;/td&gt;
&lt;td&gt;42 ms&lt;/td&gt;
&lt;td&gt;14 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.0x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;From naive Rust&lt;/td&gt;
&lt;td&gt;120,000 ms&lt;/td&gt;
&lt;td&gt;466 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;257x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;From sgemm baseline&lt;/td&gt;
&lt;td&gt;74,758 ms&lt;/td&gt;
&lt;td&gt;466 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;160x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak RAM usage&lt;/td&gt;
&lt;td&gt;3.2 GB&lt;/td&gt;
&lt;td&gt;1.38 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;57% less&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;~24 tok/s&lt;/td&gt;
&lt;td&gt;67-71 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~3x more&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These numbers are real, measured, reproducible. See the &lt;a href="https://github.com/OnCeUponTry/RAGE-QUANT/blob/main/docs/cpu-optimizations.md" rel="noopener noreferrer"&gt;full methodology&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why is it faster?
&lt;/h2&gt;

&lt;p&gt;On modern CPUs, LLM decode (batch=1) is &lt;strong&gt;DRAM bandwidth-limited&lt;/strong&gt;, not compute-limited. By reading 1 byte (quantized) instead of 4 bytes (f32), you move 3.76x less data through the memory bus. The speedup follows directly.&lt;/p&gt;

&lt;p&gt;Additionally: &lt;strong&gt;LLVM cannot auto-vectorize the i8-to-f32 widening path.&lt;/strong&gt; It tries i8→i16→i32→f32, wasting registers. Manual &lt;code&gt;vpmovsxbd&lt;/code&gt; (i8→i32 direct) via &lt;code&gt;_mm256_cvtepi8_epi32&lt;/code&gt; is required. This is why hand-written AVX2 intrinsics beat the compiler here.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;rage-quant&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;rage_quant&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;dot_q8_0_f32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dot_q8_0_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;quantized_weights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;input_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_elements&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Auto-detects AVX2+FMA at runtime; falls back to scalar on older CPUs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supported formats: &lt;strong&gt;Q8_0&lt;/strong&gt;, &lt;strong&gt;Q6_K&lt;/strong&gt;, &lt;strong&gt;Q4_K&lt;/strong&gt; (GGUF-native blocks).&lt;/p&gt;




&lt;h2&gt;
  
  
  Why not just use llama.cpp?
&lt;/h2&gt;

&lt;p&gt;llama.cpp is excellent, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It is C/C++&lt;/strong&gt; — integrating into a Rust project requires unsafe FFI bindings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is monolithic&lt;/strong&gt; — you cannot extract just the quantized dot product without pulling the entire engine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;rage-quant is a standalone Rust crate&lt;/strong&gt; — &lt;code&gt;cargo add rage-quant&lt;/code&gt; and you have the kernels&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  CPU Optimization Findings (T1-T9)
&lt;/h2&gt;

&lt;p&gt;This crate embodies 9 validated CPU inference optimizations discovered during development:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;What was optimized&lt;/th&gt;
&lt;th&gt;Measured result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;T1&lt;/td&gt;
&lt;td&gt;GEMV on quantized data (skip f32)&lt;/td&gt;
&lt;td&gt;decode 42ms → 18ms = &lt;strong&gt;2.3x&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T2&lt;/td&gt;
&lt;td&gt;Eliminate dense f32 weight caches&lt;/td&gt;
&lt;td&gt;RSS 3.2GB → 1.38GB = &lt;strong&gt;-57% RAM&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T3&lt;/td&gt;
&lt;td&gt;AVX2 widening i8→f32 intrinsics&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;+18.8%&lt;/strong&gt; on top of T1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T4&lt;/td&gt;
&lt;td&gt;Memory-bound diagnosis&lt;/td&gt;
&lt;td&gt;Proved DRAM is the bottleneck&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T7&lt;/td&gt;
&lt;td&gt;GEMV vs sgemm for m=1 decode&lt;/td&gt;
&lt;td&gt;sgemm 180ms vs GEMV 18ms = &lt;strong&gt;10x&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T8&lt;/td&gt;
&lt;td&gt;QKV fusion (decode-only path)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1.8x&lt;/strong&gt; per-layer QKV compute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T9&lt;/td&gt;
&lt;td&gt;Column-tiling for GEMM prefill&lt;/td&gt;
&lt;td&gt;5091ms → 3057ms = &lt;strong&gt;1.67x&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Hardware Requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimum&lt;/strong&gt;: Any x86_64 CPU (scalar fallback works everywhere)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommended&lt;/strong&gt;: AVX2+FMA support (Intel Haswell 2013+ / AMD Zen 2017+)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tested on&lt;/strong&gt;: AMD Ryzen 9 9900X (Zen 5), DDR5, 12 threads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ARM NEON and AVX-512 support are planned.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/OnCeUponTry/RAGE-QUANT" rel="noopener noreferrer"&gt;github.com/OnCeUponTry/RAGE-QUANT&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace&lt;/strong&gt;: &lt;a href="https://huggingface.co/TheRagestBoy/rage-quant" rel="noopener noreferrer"&gt;hf.co/TheRagestBoy/rage-quant&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Crates.io&lt;/strong&gt;: &lt;a href="https://crates.io/crates/rage-quant" rel="noopener noreferrer"&gt;crates.io/crates/rage-quant&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  License
&lt;/h2&gt;

&lt;p&gt;Dual-licensed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AGPL-3.0&lt;/strong&gt; — free for open-source, personal, and academic use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Commercial&lt;/strong&gt; — for proprietary/closed-source use (contact: &lt;a href="mailto:the@angriestboy.com"&gt;the@angriestboy.com&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Published from RAGE-QUANT v0.1.0 — pure Rust, zero dependencies, 3x faster.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>performance</category>
      <category>rust</category>
    </item>
  </channel>
</rss>
