CARLOS ENRIQUE CASTRO LAZARO

Posted on Apr 17

RAGE-QUANT: 3x Faster LLM Inference on CPU with Pure Rust Quantized GEMV

#llm #machinelearning #performance #rust

Skip dequantization. Save 57% RAM. Get 3x faster decode. No GPU required.

Every LLM framework (llama.cpp, candle, burn) does this:

GGUF quantized weights → dequantize to f32 → f32 GEMV → result
             4x DRAM bandwidth wasted ^     ^ 3.2 GB RAM for dense cache

RAGE-QUANT does this instead:

GGUF quantized weights → quantized GEMV → result
         reads 1.06 bytes/element instead of 4 bytes = 3.76x less DRAM traffic

No dequantization step. No f32 cache. 57% less RAM. 3x faster decode.

Real Benchmarks (not theoretical)

Tested on Qwen3-0.6B-Q8_0.gguf | CPU-only | AMD Ryzen 9 9900X | 12 threads

What we measured	Before	After	Improvement
Decode latency per token	42 ms	14 ms	3.0x faster
From naive Rust	120,000 ms	466 ms	257x faster
From sgemm baseline	74,758 ms	466 ms	160x faster
Peak RAM usage	3.2 GB	1.38 GB	57% less
Throughput	~24 tok/s	67-71 tok/s	~3x more

These numbers are real, measured, reproducible. See the full methodology.

Why is it faster?

On modern CPUs, LLM decode (batch=1) is DRAM bandwidth-limited, not compute-limited. By reading 1 byte (quantized) instead of 4 bytes (f32), you move 3.76x less data through the memory bus. The speedup follows directly.

Additionally: LLVM cannot auto-vectorize the i8-to-f32 widening path. It tries i8→i16→i32→f32, wasting registers. Manual vpmovsxbd (i8→i32 direct) via _mm256_cvtepi8_epi32 is required. This is why hand-written AVX2 intrinsics beat the compiler here.

Quick Start

[dependencies]
rage-quant = "0.1"

use rage_quant::dot_q8_0_f32;

let result = dot_q8_0_f32(&quantized_weights, &input_vector, num_elements);
// Auto-detects AVX2+FMA at runtime; falls back to scalar on older CPUs

Supported formats: Q8_0, Q6_K, Q4_K (GGUF-native blocks).

Why not just use llama.cpp?

llama.cpp is excellent, but:

It is C/C++ — integrating into a Rust project requires unsafe FFI bindings
It is monolithic — you cannot extract just the quantized dot product without pulling the entire engine
rage-quant is a standalone Rust crate — cargo add rage-quant and you have the kernels

CPU Optimization Findings (T1-T9)

This crate embodies 9 validated CPU inference optimizations discovered during development:

ID	What was optimized	Measured result
T1	GEMV on quantized data (skip f32)	decode 42ms → 18ms = 2.3x
T2	Eliminate dense f32 weight caches	RSS 3.2GB → 1.38GB = -57% RAM
T3	AVX2 widening i8→f32 intrinsics	+18.8% on top of T1
T4	Memory-bound diagnosis	Proved DRAM is the bottleneck
T7	GEMV vs sgemm for m=1 decode	sgemm 180ms vs GEMV 18ms = 10x
T8	QKV fusion (decode-only path)	1.8x per-layer QKV compute
T9	Column-tiling for GEMM prefill	5091ms → 3057ms = 1.67x

Hardware Requirements

Minimum: Any x86_64 CPU (scalar fallback works everywhere)
Recommended: AVX2+FMA support (Intel Haswell 2013+ / AMD Zen 2017+)
Tested on: AMD Ryzen 9 9900X (Zen 5), DDR5, 12 threads

ARM NEON and AVX-512 support are planned.

License

Dual-licensed:

AGPL-3.0 — free for open-source, personal, and academic use
Commercial — for proprietary/closed-source use (contact: the@angriestboy.com)

Published from RAGE-QUANT v0.1.0 — pure Rust, zero dependencies, 3x faster.

DEV Community