Skip dequantization. Save 57% RAM. Get 3x faster decode. No GPU required.
Every LLM framework (llama.cpp, candle, burn) does this:
GGUF quantized weights → dequantize to f32 → f32 GEMV → result
4x DRAM bandwidth wasted ^ ^ 3.2 GB RAM for dense cache
RAGE-QUANT does this instead:
GGUF quantized weights → quantized GEMV → result
reads 1.06 bytes/element instead of 4 bytes = 3.76x less DRAM traffic
No dequantization step. No f32 cache. 57% less RAM. 3x faster decode.
Real Benchmarks (not theoretical)
Tested on Qwen3-0.6B-Q8_0.gguf | CPU-only | AMD Ryzen 9 9900X | 12 threads
| What we measured | Before | After | Improvement |
|---|---|---|---|
| Decode latency per token | 42 ms | 14 ms | 3.0x faster |
| From naive Rust | 120,000 ms | 466 ms | 257x faster |
| From sgemm baseline | 74,758 ms | 466 ms | 160x faster |
| Peak RAM usage | 3.2 GB | 1.38 GB | 57% less |
| Throughput | ~24 tok/s | 67-71 tok/s | ~3x more |
These numbers are real, measured, reproducible. See the full methodology.
Why is it faster?
On modern CPUs, LLM decode (batch=1) is DRAM bandwidth-limited, not compute-limited. By reading 1 byte (quantized) instead of 4 bytes (f32), you move 3.76x less data through the memory bus. The speedup follows directly.
Additionally: LLVM cannot auto-vectorize the i8-to-f32 widening path. It tries i8→i16→i32→f32, wasting registers. Manual vpmovsxbd (i8→i32 direct) via _mm256_cvtepi8_epi32 is required. This is why hand-written AVX2 intrinsics beat the compiler here.
Quick Start
[dependencies]
rage-quant = "0.1"
use rage_quant::dot_q8_0_f32;
let result = dot_q8_0_f32(&quantized_weights, &input_vector, num_elements);
// Auto-detects AVX2+FMA at runtime; falls back to scalar on older CPUs
Supported formats: Q8_0, Q6_K, Q4_K (GGUF-native blocks).
Why not just use llama.cpp?
llama.cpp is excellent, but:
- It is C/C++ — integrating into a Rust project requires unsafe FFI bindings
- It is monolithic — you cannot extract just the quantized dot product without pulling the entire engine
-
rage-quant is a standalone Rust crate —
cargo add rage-quantand you have the kernels
CPU Optimization Findings (T1-T9)
This crate embodies 9 validated CPU inference optimizations discovered during development:
| ID | What was optimized | Measured result |
|---|---|---|
| T1 | GEMV on quantized data (skip f32) | decode 42ms → 18ms = 2.3x |
| T2 | Eliminate dense f32 weight caches | RSS 3.2GB → 1.38GB = -57% RAM |
| T3 | AVX2 widening i8→f32 intrinsics | +18.8% on top of T1 |
| T4 | Memory-bound diagnosis | Proved DRAM is the bottleneck |
| T7 | GEMV vs sgemm for m=1 decode | sgemm 180ms vs GEMV 18ms = 10x |
| T8 | QKV fusion (decode-only path) | 1.8x per-layer QKV compute |
| T9 | Column-tiling for GEMM prefill | 5091ms → 3057ms = 1.67x |
Hardware Requirements
- Minimum: Any x86_64 CPU (scalar fallback works everywhere)
- Recommended: AVX2+FMA support (Intel Haswell 2013+ / AMD Zen 2017+)
- Tested on: AMD Ryzen 9 9900X (Zen 5), DDR5, 12 threads
ARM NEON and AVX-512 support are planned.
Links
- GitHub: github.com/OnCeUponTry/RAGE-QUANT
- HuggingFace: hf.co/TheRagestBoy/rage-quant
- Crates.io: crates.io/crates/rage-quant
License
Dual-licensed:
- AGPL-3.0 — free for open-source, personal, and academic use
- Commercial — for proprietary/closed-source use (contact: the@angriestboy.com)
Published from RAGE-QUANT v0.1.0 — pure Rust, zero dependencies, 3x faster.
Top comments (0)