DEV Community

Cover image for KVQuant: real terminal proof for KV-cache compression
Aman Sachan
Aman Sachan

Posted on

KVQuant: real terminal proof for KV-cache compression

KVQuant: real terminal proof for KV-cache compression

KVQuant is a cache-compression layer for long-context inference. The interesting bit is not the idea — lots of projects have that — but whether it survives contact with a real model, a real terminal, and a real benchmark table.

This write-up is the boring but useful version: what it does, what I ran, what the numbers were, and where it helps or doesn’t.


Why KV cache matters

When a model generates text, it keeps a memory of previous tokens in the KV cache. That cache grows with every step. Weight quantisation shrinks the model weights, but it doesn’t directly touch this memory tax.

KVQuant targets that cache directly:

  1. Allocate fewer bits for older tokens
  2. Pack the cache into smaller storage
  3. Restore it before the next forward pass

That gives you a real memory win on long-running chats and long-context inference.


What I benchmarked

I ran two kinds of proof:

  • a real Hugging Face model run with distilgpt2
  • a deterministic synthetic cache benchmark to make the cache math obvious and reproducible

Real-model result

Scenario Prompt tokens Generated tokens Baseline cache KVQuant cache Saved Cache ratio KVQuant compression
product-explainer 17 256 9.56 MiB 2.39 MiB 7.17 MiB 4.00x 8.00x
developer-note 19 256 9.63 MiB 2.41 MiB 7.22 MiB 4.00x 8.00x

Total cache saved: 14.40 MiB

Honest speed note

Scenario Baseline t/s KVQuant t/s Speedup
product-explainer 21.17 16.05 0.76x
developer-note 21.88 20.10 0.92x

That is the part I do not want to hide: on a small CPU model, compression overhead can offset throughput gains. The memory savings are real; the wall-clock speedup is workload-dependent.


Actual terminal proof

This is the real terminal run I captured. The key part is that it is a direct terminal transcript from a benchmark script, not a dashboard summary.

Terminal proof

Exact command run

source /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/.venv/bin/activate
cd /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/KVQuant
HF_HUB_DISABLE_PROGRESS_BARS=1 PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output
Enter fullscreen mode Exit fullscreen mode

Step-by-step terminal output

1) Benchmark started
# KVQuant end-to-end benchmark (distilgpt2)

2) Model and generation mode
Real Hugging Face causal LM, real greedy generation, and real output tokens.

3) Measured table
| Scenario | Prompt tokens | Generated tokens | Baseline t/s | KVQuant t/s | Speedup | Baseline cache | KVQuant cache | Saved | Cache ratio | KVQuant compression |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| product-explainer | 17 | 256 | 21.17 | 16.05 | 0.76x | 9.56 MiB | 2.39 MiB | 7.17 MiB | 4.00x | 8.00x |
| developer-note | 19 | 256 | 21.88 | 20.10 | 0.92x | 9.63 MiB | 2.41 MiB | 7.22 MiB | 4.00x | 8.00x |

4) Summary
**Average speedup:** 0.84x
**Average cache ratio:** 4.00x
**Average generated tokens:** 256
**Total cache saved:** 14.40 MiB

5) File outputs
HTML: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.html
JSON: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.json
Markdown: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.md
Enter fullscreen mode Exit fullscreen mode

Browser-rendered proof

Rendered benchmark report

That report is the same data in a screenshot-friendly table, so you can inspect the numbers without trusting a hand-written summary.


Synthetic cache baseline

I also ran a deterministic cache-growth benchmark to make the storage story easy to read:

Scenario Shape Without KVQuant With KVQuant Saved Ratio
chat-turn (1, 8, 512, 64) 0.50 MiB 0.13 MiB 0.38 MiB 4.00x
code-assist (1, 16, 1024, 64) 1.00 MiB 0.25 MiB 0.75 MiB 4.00x
rag-summary (1, 16, 2048, 64) 2.00 MiB 0.50 MiB 1.50 MiB 4.00x
tool-agent (1, 32, 2048, 128) 8.00 MiB 2.00 MiB 6.00 MiB 4.00x
long-context (1, 32, 4096, 128) 16.00 MiB 4.00 MiB 12.00 MiB 4.00x
tiny-firmware (1, 4, 256, 64) 0.0625 MiB 0.0156 MiB 0.0469 MiB 4.00x

The consistent 4x result is exactly what you’d expect from packing fp16 cache tensors into 4-bit storage.


Tiny export profile

I added a tiny export profile so the repo has a clean, shippable benchmark shape even when you don’t want the full run.

Tiny export proof

That profile is there for constrained builds, screenshots, and quick proof-of-life checks.


Repro steps

If you want to rerun the exact thing locally:

git clone https://github.com/AmSach/KVQuant.git
cd KVQuant
pip install -e .
PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir ./benchmark-results
Enter fullscreen mode Exit fullscreen mode

For the synthetic cache profile:

PYTHONPATH=. python examples/e2e_benchmark.py --profile tiny
Enter fullscreen mode Exit fullscreen mode

What this means in practice

KVQuant is not a magic throughput booster on every machine. That would be fake.

What it is, though, is a real memory reducer for cache-heavy inference:

  • useful for long chats
  • useful for bigger contexts
  • useful when the cache starts becoming the bottleneck
  • useful when you need honest numbers instead of vibes

That’s the actual claim.


Source and assets


If you want one sentence for the headline: KVQuant makes the cache visible, measurable, and smaller.

Top comments (0)