I built KVQuant because running large LLMs locally is a nightmare — not because of model weights, but because of the KV cache.
The Problem
| Model | Weights (4-bit) | KV Cache (128K ctx) | Total |
|---|---|---|---|
| Llama-3-70B | 40GB | 256GB | 296GB |
Existing quantization (llama.cpp, etc.) only compresses weights. The KV cache still explodes your memory on long conversations.
What KVQuant Does
Compresses the KV cache with adaptive quantization based on token importance:
| Token Position | Bits | Reason |
|---|---|---|
| Recent (0-256) | 4-bit | Attention often attends here |
| Mid (256-1024) | 3-bit | Medium importance |
| Old (1024+) | 2-bit | Distant context |
Features
- 4-6x KV cache compression with <1% perplexity increase
- Drop-in — single pip install, no model recompilation
- Real-time — adds <5ms latency per token
- Cross-platform — CUDA, MPS (Apple Silicon), CPU
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
from kvquant import KVQuant
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
with KVQuant(model, target_memory_gb=4.0):
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
Benchmarks
| Model | Context | Original KV | Compressed KV | Ratio |
|---|---|---|---|---|
| Llama-3-8B | 128K | 32GB | 8GB | 4x |
Top comments (0)