João André Gomes Marques

Posted on Apr 7 • Edited on May 4 • Originally published at dev.to

What I Learned Testing 12 Compression Approaches That Failed

#machinelearning #llm #research #python

What I Learned Testing 12 Compression Approaches That Failed

The most useful research I've done this year isn't in the NexusQuant paper. It's the experiments that failed, the ideas that sounded smart in theory and didn't survive contact with real KV cache data.

Negative results build trust. They also save time - if you're working on KV cache compression, this list might save you weeks of effort.

Each entry: what we tried, what we expected, what happened, what we learned.

1. PCA Rotation (3x Worse Distortion)

The idea: Apply PCA to KV vectors to align the quantization axes with the data's principal components. This is optimal for Gaussian data - principal components diagonalize the covariance matrix, and uniform quantization along them minimizes MSE.

What happened: 3x worse distortion than Hadamard rotation. PPL degradation jumped ~0.9 percentage points at the same compression ratio.

Why: PCA is computed per-layer from calibration data. KV distributions shift by layer depth, head index, token position, and input domain. The PCA axes fit the calibration statistics, not the inference statistics. Hadamard rotation, being data-free and orthogonal, spreads energy uniformly without overfitting. It's less optimal in theory but more robust in practice.

Lesson: Data-free rotations outperform data-fitted rotations when distribution shift is unavoidable.

2. Group Size 32 (Compression Ratio Penalty)

The idea: Use group size 32 instead of 8 for per-group scaling. Larger groups reduce scale overhead - from 2 bits/dim (group-8) to 0.5 bits/dim (group-32).

What happened: The quality drop was larger than expected (+0.4% PPL at the same nominal bit rate). The compression improvement was real but smaller than the accounting suggested, because zstd compresses the scale stream efficiently anyway.

Why: Group-32 means one scale covers 4x more variance. After Hadamard, variance is equalized globally but not perfectly at the group level. Group-32 is more sensitive to the residual non-uniformity. The E8 lattice's quantization error grows with scale mismatch.

Lesson: The "free" bits from larger groups aren't free - they trade against quantization accuracy nonlinearly.

3. Adaptive Bitwidth Allocation (Negligible Gain)

The idea: Assign 4-bit quantization to "important" KV vectors (high attention weight) and 2-bit to unimportant ones. Budget the average to 3 bits. Better attention quality where it matters.

What happened: +0.05% PPL improvement over flat 3-bit. Not worth the complexity - requires attention weights at quantization time, adds a routing pass, complicates the bit-packing scheme.

Why: After token eviction, the surviving tokens are already the important ones. The remaining low-importance tokens were mostly evicted. So adaptive bitwidth is redundant - eviction already did the importance filtering.

Lesson: Token eviction and quantization solve related problems. Don't do both; pick one per axis (count vs precision).

4. Per-Head Token Eviction (Catastrophic)

The idea: Each attention head attends to different tokens. Instead of using a consensus importance score across heads, evict different tokens per head based on that head's individual attention pattern.

What happened: +47% PPL degradation. Completely broken.

Why: The KV cache serves all heads simultaneously. If head A evicts token 42 and head B keeps it, you need two separate caches - or you corrupt head B's keys/values. We implemented the wrong version first (corrupting B's cache). The correct implementation requires separate storage per head, which multiplies memory overhead, defeating the compression goal.

Lesson: KV cache is shared infrastructure. Eviction must operate on the shared sequence, not per-head slices.

5. Token Merging (Temporal Predictive Coding) (+107% PPL)

The idea: Instead of evicting tokens, merge similar adjacent tokens into one representative vector. Saves memory while theoretically preserving all information.

What happened: +107% PPL degradation on Mistral-7B at 60% merge rate. Completely unusable.

Why: Token positions carry semantic meaning. Merging token i and i+1 into their mean destroys the positional distinction. When the model generates the next token, it uses Q·K attention over the merged KV, which now corresponds to no specific position. RoPE embeddings make this worse - each position has a distinct rotation, and the mean of two rotated vectors is not a valid rotated vector.

We tried removing RoPE before merging - still +31% PPL. The positional information is baked into the content, not just the rotation.

Lesson: Token position is semantic, not just a coordinate. Merging destroys it.

6. Entropy-Coded Indices Without Delta Coding (2x Compression Loss)

The idea: Apply Huffman or arithmetic coding directly to E8 lattice indices. The index distribution is non-uniform, so entropy coding should compress well.

What happened: ~1.4x compression from the entropy coder alone vs. the expected ~2.5x.

Why: Raw lattice indices have high entropy because each token's KV vector is quantized independently. Adjacent tokens produce different indices even for similar semantic content. The distribution is wide and moderately uniform.

Delta coding first - subtracting each token's index from the previous token's index - produces a distribution tightly peaked at 0 (most tokens change little). Then entropy coding achieves 2.3x. The pipeline order matters enormously.

Lesson: Statistics of compressed representations are not the statistics of the raw data. Profile what you're actually compressing.

7. Outlier-Aware Quantization (Wasted on KV)

The idea: Popular in weight quantization (SmoothQuant, LLM.int8()) - identify outlier channels, scale them separately, quantize the rest normally. Should handle the heavy tails in KV data.

What happened: +0.02% improvement over standard Hadamard + E8. Negligible.

Why: KV cache outliers exist, but Hadamard rotation already spreads them across all dimensions. By the time E8 sees the data, there are no outlier channels - just mildly non-uniform overall magnitude. Outlier-specific treatment addresses a problem that Hadamard already solved.

Lesson: Don't apply weight-quantization techniques to KV quantization without checking whether the problem definition is the same.

8. Low-Rank Key Approximation (Quality Cliff)

The idea: Compute SVD of the key cache, keep the top-r singular values/vectors. Aggressive rank reduction = high compression. No quantization needed.

What happened: Worked well at r ≥ 64 (head_dim=128, so 50% rank). At r=32, quality fell off a cliff: +8% PPL. At r=16: +40% PPL.

Why: KV caches do not have a clean low-rank structure across the time dimension. They can have low-rank structure within a head's channel dimension, but the rank-r approximation confuses these two. Also, SVD is expensive - O(n²r) per layer per generation step.

Lesson: Dimensionality reduction across the token axis and the channel axis are different problems. SVD on the full cache conflates both.

9. FP8 Quantization (Too Naive)

The idea: Just cast KV to FP8 (e5m2 or e4m3). Modern GPUs support FP8 natively. Fast, hardware-accelerated, 2x from FP16.

What happened: +0.8% PPL for only 2x compression. E8 lattice at comparable bit rate is 0.4% PPL with 7-8x compression.

Why: FP8 is 8 bits per element. E8 achieves ~2-3 effective bits per element after entropy coding. FP8 is not a compression technique - it's a lower-precision storage format. The 2x ratio only matches FP16 → FP8, while E8 is targeting FP16 → 2.5 effective bits.

Lesson: Comparing "FP8" to "2-bit E8" conflates precision reduction with compression. They're solving different parts of the same problem.

10. Importance Scoring by L2 Norm (Wrong Signal)

The idea: Evict tokens with small L2 norm - small keys probably don't attract attention.

What happened: +12% PPL at 40% eviction. Much worse than cross-head attention scoring.

Why: L2 norm doesn't predict attention weight. A small-norm key can still attract high attention if the corresponding query also points in that direction. Attention weight is Q·K / √d, not |K|. Using |K| alone ignores Q entirely.

Lesson: Importance is relational (Q, K together), not intrinsic to K alone.

11. Dynamic Eviction During Generation (Latency Catastrophe)

The idea: Re-score and re-evict tokens every N generation steps. Adaptive to changing attention patterns as generation progresses.

What happened: 8x inference latency increase. Throughput collapsed.

Why: Re-scoring requires running the importance computation over the full current KV cache (O(seq²) per step). At 100K tokens, this dominates the generation step. We had it right the first time: evict once after prefill, never again during generation.

Lesson: Prefill and decode phases have fundamentally different compute budgets. Prefill is O(n²). Decode is O(n). Don't move O(n²) work into the decode loop.

12. Calibration Data for Quantization Scales (Overfitting)

The idea: Instead of computing per-group scales from the current KV values, pre-compute scales from a calibration corpus and reuse them. Saves compute at inference.

What happened: +1.9% PPL on out-of-domain data. Scales tuned on Wikipedia performed poorly on code or dialogue.

Why: KV magnitude profiles are domain-specific. Code tokens, dialogue, and factual text have different KV distributions at different layers. Per-inference scaling, though slightly more expensive, adapts to the actual data being compressed.

Lesson: KV cache is input-dependent. Calibration data solves the wrong version of the problem. TurboQuant's approach (calibration for weight quantization, not KV) is fine because weights don't change with input.

What Actually Works

After 12 failures, the surviving pipeline is:

Cross-head attention importance scoring → token eviction (2.5x from fewer tokens)
RoPE removal on keys
Hadamard rotation
E8 lattice VQ, group-8, per-group FP16 scales (3-8x from fewer bits)
Delta coding → zstd (2-3x from entropy coding)

Five stages, each justified by a measurable improvement that survived ablation. Everything else, we tried and measured and deleted.

The cleaner version with just eviction + E8 (no entropy coding) still achieves 10x at +0.4% PPL. That's the "high quality" preset. The full pipeline gets to 33x.

Repo: github.com/jagmarques/nexusquant

pip install nexusquant-kv

Best regards, João Marques

DEV Community

What I Learned Testing 12 Compression Approaches That Failed

What I Learned Testing 12 Compression Approaches That Failed

1. PCA Rotation (3x Worse Distortion)

2. Group Size 32 (Compression Ratio Penalty)

3. Adaptive Bitwidth Allocation (Negligible Gain)

4. Per-Head Token Eviction (Catastrophic)

5. Token Merging (Temporal Predictive Coding) (+107% PPL)

6. Entropy-Coded Indices Without Delta Coding (2x Compression Loss)

7. Outlier-Aware Quantization (Wasted on KV)

8. Low-Rank Key Approximation (Quality Cliff)

9. FP8 Quantization (Too Naive)

10. Importance Scoring by L2 Norm (Wrong Signal)

11. Dynamic Eviction During Generation (Latency Catastrophe)

12. Calibration Data for Quantization Scales (Overfitting)

What Actually Works

Top comments (0)