DEV Community

João André Gomes Marques
João André Gomes Marques

Posted on

Why E8 lattice quantization beats scalar quantization for KV caches

Most KV cache quantization methods treat each number independently: round each float to the nearest 2-bit or 4-bit value. This works, but it wastes bits.

The E8 lattice quantizes 8 numbers at once, exploiting correlations between dimensions. The result: 3x better compression under entropy coding compared to scalar quantization at the same distortion.

The problem with scalar quantization

Given a 128-dimensional KV vector, scalar INT2 quantization rounds each of the 128 values independently. Each value gets mapped to one of 4 levels. The indices are near-uniformly distributed, so entropy coding (zstd, Huffman) barely helps — maybe 1.2x reduction.

E8: quantize 8 at a time

The E8 root lattice is the densest sphere packing in 8 dimensions. Instead of rounding each number to {-1, 0, 1, 2}, we split each 128-dim vector into 16 groups of 8, and snap each group to the nearest E8 lattice point.

from nexusquant.core.e8_lattice import E8Lattice

# Quantize 8D groups
groups = vector.reshape(-1, 8)
lattice_points = E8Lattice.nearest_point(groups)
Enter fullscreen mode Exit fullscreen mode

The key insight: E8 nearest-neighbor assignment is non-uniform. Certain lattice points are hit far more often than others because real KV data clusters in specific regions of 8D space. This skew creates a highly compressible distribution.

The entropy advantage

Quantizer zstd compression on indices
Scalar INT2 1.23x
E8 2-bit 3.74x

That's a 3x advantage from the lattice structure alone. It comes from E8's parity constraints and peaked shell occupancy — mathematical properties of the lattice that happen to align with how KV cache data distributes.

But there's a catch: outliers

Raw KV vectors are heavy-tailed. One outlier dimension inflates the quantization scale for all 8 dimensions in its group. Fix: apply a Hadamard rotation first.

from nexusquant.core.hadamard import hadamard_matrix

H = hadamard_matrix(128)
rotated = vector @ H.T  # spread energy uniformly
# now quantize rotated vector with E8
Enter fullscreen mode Exit fullscreen mode

Hadamard rotation is orthogonal — no information loss. It just spreads each component across all dimensions, making the distribution near-isotropic. After rotation, E8 quantization at 2 bits/dim causes less than 0.1% PPL degradation.

The combination of Hadamard + E8 is what makes NexusQuant work. Removing either one degrades quality significantly.

GitHub | Paper

Best regards,
João Marques

Top comments (0)