KVQuant: real terminal proof for KV-cache compression
KVQuant is a cache-compression layer for long-context inference. The interesting bit is not the idea — lots of projects have that — but whether it survives contact with a real model, a real terminal, and a real benchmark table.
This write-up is the boring but useful version: what it does, what I ran, what the numbers were, and where it helps or doesn’t.
Why KV cache matters
When a model generates text, it keeps a memory of previous tokens in the KV cache. That cache grows with every step. Weight quantisation shrinks the model weights, but it doesn’t directly touch this memory tax.
KVQuant targets that cache directly:
- Allocate fewer bits for older tokens
- Pack the cache into smaller storage
- Restore it before the next forward pass
That gives you a real memory win on long-running chats and long-context inference.
What I benchmarked
I ran two kinds of proof:
- a real Hugging Face model run with
distilgpt2 - a deterministic synthetic cache benchmark to make the cache math obvious and reproducible
Real-model result
| Scenario | Prompt tokens | Generated tokens | Baseline cache | KVQuant cache | Saved | Cache ratio | KVQuant compression |
|---|---|---|---|---|---|---|---|
| product-explainer | 17 | 256 | 9.56 MiB | 2.39 MiB | 7.17 MiB | 4.00x | 8.00x |
| developer-note | 19 | 256 | 9.63 MiB | 2.41 MiB | 7.22 MiB | 4.00x | 8.00x |
Total cache saved: 14.40 MiB
Honest speed note
| Scenario | Baseline t/s | KVQuant t/s | Speedup |
|---|---|---|---|
| product-explainer | 21.17 | 16.05 | 0.76x |
| developer-note | 21.88 | 20.10 | 0.92x |
That is the part I do not want to hide: on a small CPU model, compression overhead can offset throughput gains. The memory savings are real; the wall-clock speedup is workload-dependent.
Actual terminal proof
This is the real terminal run I captured. The key part is that it is a direct terminal transcript from a benchmark script, not a dashboard summary.
Exact command run
source /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/.venv/bin/activate
cd /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/KVQuant
HF_HUB_DISABLE_PROGRESS_BARS=1 PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output
Step-by-step terminal output
1) Benchmark started
# KVQuant end-to-end benchmark (distilgpt2)
2) Model and generation mode
Real Hugging Face causal LM, real greedy generation, and real output tokens.
3) Measured table
| Scenario | Prompt tokens | Generated tokens | Baseline t/s | KVQuant t/s | Speedup | Baseline cache | KVQuant cache | Saved | Cache ratio | KVQuant compression |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| product-explainer | 17 | 256 | 21.17 | 16.05 | 0.76x | 9.56 MiB | 2.39 MiB | 7.17 MiB | 4.00x | 8.00x |
| developer-note | 19 | 256 | 21.88 | 20.10 | 0.92x | 9.63 MiB | 2.41 MiB | 7.22 MiB | 4.00x | 8.00x |
4) Summary
**Average speedup:** 0.84x
**Average cache ratio:** 4.00x
**Average generated tokens:** 256
**Total cache saved:** 14.40 MiB
5) File outputs
HTML: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.html
JSON: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.json
Markdown: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.md
Browser-rendered proof
That report is the same data in a screenshot-friendly table, so you can inspect the numbers without trusting a hand-written summary.
Synthetic cache baseline
I also ran a deterministic cache-growth benchmark to make the storage story easy to read:
| Scenario | Shape | Without KVQuant | With KVQuant | Saved | Ratio |
|---|---|---|---|---|---|
| chat-turn | (1, 8, 512, 64) | 0.50 MiB | 0.13 MiB | 0.38 MiB | 4.00x |
| code-assist | (1, 16, 1024, 64) | 1.00 MiB | 0.25 MiB | 0.75 MiB | 4.00x |
| rag-summary | (1, 16, 2048, 64) | 2.00 MiB | 0.50 MiB | 1.50 MiB | 4.00x |
| tool-agent | (1, 32, 2048, 128) | 8.00 MiB | 2.00 MiB | 6.00 MiB | 4.00x |
| long-context | (1, 32, 4096, 128) | 16.00 MiB | 4.00 MiB | 12.00 MiB | 4.00x |
| tiny-firmware | (1, 4, 256, 64) | 0.0625 MiB | 0.0156 MiB | 0.0469 MiB | 4.00x |
The consistent 4x result is exactly what you’d expect from packing fp16 cache tensors into 4-bit storage.
Tiny export profile
I added a tiny export profile so the repo has a clean, shippable benchmark shape even when you don’t want the full run.
That profile is there for constrained builds, screenshots, and quick proof-of-life checks.
Repro steps
If you want to rerun the exact thing locally:
git clone https://github.com/AmSach/KVQuant.git
cd KVQuant
pip install -e .
PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir ./benchmark-results
For the synthetic cache profile:
PYTHONPATH=. python examples/e2e_benchmark.py --profile tiny
What this means in practice
KVQuant is not a magic throughput booster on every machine. That would be fake.
What it is, though, is a real memory reducer for cache-heavy inference:
- useful for long chats
- useful for bigger contexts
- useful when the cache starts becoming the bottleneck
- useful when you need honest numbers instead of vibes
That’s the actual claim.
Source and assets
- Repo: https://github.com/AmSach/KVQuant
- Terminal proof screenshot: https://man42.zo.space/assets/kvquant-terminal-proof.png
- Benchmark report screenshot: https://man42.zo.space/assets/kvquant-e2e-proof.png
- Tiny profile screenshot: https://man42.zo.space/assets/kvquant-real-benchmark.png
If you want one sentence for the headline: KVQuant makes the cache visible, measurable, and smaller.



Top comments (0)