Most KV cache quantization methods treat each number independently: round each float to the nearest 2-bit or 4-bit value. This works, but it wastes bits.
The E8 lattice quantizes 8 numbers at once, exploiting correlations between dimensions. The result: 3x better compression under entropy coding compared to scalar quantization at the same distortion.
The problem with scalar quantization
Given a 128-dimensional KV vector, scalar INT2 quantization rounds each of the 128 values independently. Each value gets mapped to one of 4 levels. The indices are near-uniformly distributed, so entropy coding (zstd, Huffman) barely helps — maybe 1.2x reduction.
E8: quantize 8 at a time
The E8 root lattice is the densest sphere packing in 8 dimensions. Instead of rounding each number to {-1, 0, 1, 2}, we split each 128-dim vector into 16 groups of 8, and snap each group to the nearest E8 lattice point.
from nexusquant.core.e8_lattice import E8Lattice
# Quantize 8D groups
groups = vector.reshape(-1, 8)
lattice_points = E8Lattice.nearest_point(groups)
The key insight: E8 nearest-neighbor assignment is non-uniform. Certain lattice points are hit far more often than others because real KV data clusters in specific regions of 8D space. This skew creates a highly compressible distribution.
The entropy advantage
| Quantizer | zstd compression on indices |
|---|---|
| Scalar INT2 | 1.23x |
| E8 2-bit | 3.74x |
That's a 3x advantage from the lattice structure alone. It comes from E8's parity constraints and peaked shell occupancy — mathematical properties of the lattice that happen to align with how KV cache data distributes.
But there's a catch: outliers
Raw KV vectors are heavy-tailed. One outlier dimension inflates the quantization scale for all 8 dimensions in its group. Fix: apply a Hadamard rotation first.
from nexusquant.core.hadamard import hadamard_matrix
H = hadamard_matrix(128)
rotated = vector @ H.T # spread energy uniformly
# now quantize rotated vector with E8
Hadamard rotation is orthogonal — no information loss. It just spreads each component across all dimensions, making the distribution near-isotropic. After rotation, E8 quantization at 2 bits/dim causes less than 0.1% PPL degradation.
The combination of Hadamard + E8 is what makes NexusQuant work. Removing either one degrades quality significantly.
Best regards,
João Marques
Top comments (0)