There are now enough KV cache compression papers that "we beat the competition" is meaningless without specifics. Which competition? On which data? At which compression ratio? With or without calibration?
This post is an honest head-to-head. For each competitor: what they do, their reported numbers from their papers, our numbers, where we win, and where they win.
The comparison table
| Method | Compression | Quality | Training-free | When it wins |
|---|---|---|---|---|
| NexusQuant (short ctx) | 10x | +0.14 to +0.90% | Yes | Training-free above 6x |
| NexusQuant (long ctx) | 16.6x | +0.82% | Yes | Best training-free at 16x |
| KVTC (NVIDIA) | up to 20x | < 1pp | No (10 min cal) | Highest compression |
| TurboQuant (Google) | ~5-6x | ~0% | Yes | Best quality at low compression |
| CommVQ (Apple) | ~8x | ~0% | No (training) | Best quality at 8x |
| Palu | 11.4x | ~+1.19% | No (calibration) | Low-rank if you have calibration data |
None of these numbers are from our experiments. KVTC, TurboQuant, CommVQ, and Palu numbers are from their papers. We have not run their code on our data. That is a limitation we are transparent about.
TurboQuant (Google)
What it does: Scalar quantization with learned per-channel scale factors. 193 lines of code, training-free at the core. The simplest competitive approach.
Their reported numbers: Near-zero quality degradation at 5.3x compression on Llama-class models.
Our numbers at the same point: NexusQuant at 6.71x on Llama-3-8B shows -1.20% PPL (a quality improvement). At 5.3x it would be better still.
Where TurboQuant wins: Quality at low compression (below 6x). If you need minimal quality loss and 4-6x compression, TurboQuant is extremely competitive and simpler than our pipeline.
Where we win: Compression beyond 6x. TurboQuant does not have a token eviction component, so it cannot reach 10-16x ratios. We extend that range significantly while staying training-free.
The honest caveat: We have not run TurboQuant on our exact dataset with our exact evaluation setup. The 193-line simplicity is genuinely elegant. We have a lot of respect for this work.
KVTC (NVIDIA)
What it does: KVTC combines scalar quantization with a temporal coherence coding approach, compressing differences between adjacent KV states. Requires ~10 minutes of calibration on sample data.
Their reported numbers: Up to 20x compression with less than 1 perplexity point degradation.
Our numbers at 20x: We have validated 16.6x at +0.82% PPL. At 20x (80% eviction on 1664-token prefix), we get +2.13% PPL. KVTC appears to achieve better quality at the extreme end.
Where KVTC wins: Quality at 20x+ compression. Their temporal coding approach is genuinely effective. And 20x > 16.6x — they reach higher ratios with better quality in that regime.
Where we win: No calibration required. The 10-minute calibration step sounds trivial but matters in practice — it requires representative data, it adds a deployment step, and it means the system is not truly drop-in. Our with nexusquant_evict(model): requires no data, no calibration, nothing except the model.
The honest caveat: KVTC's 20x result is the competition's strongest. We are not the best at maximum compression.
CommVQ (Apple)
What it does: Trains a vector quantization codebook on KV cache data with an EM-style training loop. Uses a commutative design that works naturally with RoPE embeddings.
Their reported numbers: ~8x compression with near-zero quality degradation. The commutative approach is elegant — they avoid the RoPE removal issue entirely by designing the quantizer to commute with the rotation.
Our numbers at 8x: We reach 10.1x at +0.90% (500-tok prefix) or 10.4x at +0.14% (1664-tok prefix). Quality at pure 8x would be even better.
Where CommVQ wins: Quality. If you can afford training time and have representative KV cache data, a trained codebook will beat a training-free approach. Also their RoPE handling is mathematically cleaner than our blanket RoPE-removal strategy.
Where we win: Training-free. CommVQ requires a training loop. Our pipeline works on any model from the moment you install the package. No training data, no training time.
The honest caveat: We solve RoPE by removing it before quantization and reapplying after. CommVQ solves it properly at the algebra level. If we revisit this design, their approach is the right one.
Palu
What it does: Low-rank projection of the KV cache heads, combining PCA-based dimensionality reduction with quantization. Needs calibration data to compute the PCA rotation.
Their reported numbers: 11.4x compression at ~+1.19% PPL. The low-rank projection is a different approach — instead of quantizing all dimensions, they reduce the number of dimensions.
Our numbers at comparable compression: 10.4x at +0.14% PPL (1664-tok prefix). Our quality is substantially better at similar compression.
Where Palu wins: If you have calibration data and want a single-architecture low-rank approach, Palu integrates cleanly with fine-tuning workflows.
Where we win: Quality at 10-11x. +0.14% vs +1.19% is a significant gap. Also training-free for the base pipeline.
The main competitive claim
We are the best training-free option above 6x compression. Below 6x, TurboQuant matches or beats us with a simpler system. Above 6x and up to ~16x, we are the only training-free approach with validated sub-1% quality degradation.
If you need to go to 20x+, KVTC is currently ahead (with calibration). If you need near-zero quality loss at 8x and can train, CommVQ is the right choice.
None of this is cherry-picked. The numbers above are the numbers.
What we cannot claim
We have not run any competitor's code on our exact dataset. All competitor numbers come from their papers. It is possible that on wikitext-2 with our specific passage selection, TurboQuant or KVTC would show different numbers than reported. We are aware of this limitation and plan to run a proper head-to-head in our NeurIPS submission.
We also cannot claim "beats TurboQuant" in any general sense. We beat their reported numbers at high compression ratios. At low compression, their simplicity is competitive.
Best regards, João Marques
Top comments (0)