I built KVQuant because I wanted to run 70B parameter models on my gaming laptop. The problem? Even with 4-bit quantization, a 128K context window needs 256GB RAM just for the KV cache.
The Problem
When you run an LLM, the memory bottleneck is not the model weights - it is the KV cache.
| Model | Weights (4-bit) | KV Cache (128K ctx) | Total |
|---|---|---|---|
| Llama-3-8B | 5GB | 64GB | 69GB |
| Llama-3-70B | 40GB | 256GB | 296GB |
The Solution
KVQuant compresses the KV cache in real-time using per-position adaptive quantization.
Result: 4-6x compression with less than 1% perplexity increase.
Usage
from kvquant import KVQuant
compressor = KVQuant(target_memory_gb=8)
model = compressor.wrap(model)
Top comments (0)