Most quantizers are chosen for convenience. E8 was chosen because the math demanded it — and then it surprised us.
What Makes E8 Special
The E8 lattice is a root lattice in 8 dimensions with 240 nearest neighbors. Its kissing number (240) is the highest possible in 8D. Its packing density is optimal: no other 8D arrangement of equal spheres covers more space. Mathematically, E8 achieves the theoretically maximum sphere-packing density in 8 dimensions — proven by Viazovska in 2016.
For quantization, this matters because denser packing = more codewords per unit volume = lower quantization error.
Why KV Vectors Live in E8-Friendly Space
After applying a Hadamard transform to KV cache vectors, the distribution of each coordinate becomes approximately sub-Gaussian. Specifically:
- The Hadamard spreads energy uniformly across all 8 dimensions
- Each coordinate has zero mean and bounded kurtosis
- The joint distribution approximates a spherically symmetric Gaussian cloud
A spherically symmetric Gaussian is exactly what E8 was designed to quantize. The shell structure of E8 — its concentric layers of lattice points — aligns with the probability mass shells of a Gaussian. More lattice points where the data actually lives.
The Relaxed Parity Discovery
Strict E8 imposes an even-sum parity constraint: the sum of all 8 coordinates (after scaling) must be even. This halves the set of valid codewords and enforces a rigid algebraic structure.
We found something unexpected: relaxing this constraint improves MSE by 0.3–0.4% on KV cache data.
Why? Sub-Gaussian distributions have excess probability mass near the origin compared to a pure Gaussian. Strict E8 parity creates a gap at the origin — the all-zeros vector is forbidden if it violates the even-sum rule. Relaxed parity restores codepoints near zero, which is precisely where sub-Gaussian data concentrates.
This is not a bug. Nature found a better quantizer than the textbook prescribed.
Strict E8: valid if sum(coords mod 2) == 0 → 128 points per shell
Relaxed E8: always valid → 256 points per shell
Gain at origin: more codewords where sub-Gaussian data concentrates
The Numbers
On Mistral-7B KV cache vectors (Hadamard-preprocessed, group size 64):
| Quantizer | MSE (normalized) | PPL delta vs fp16 |
|---|---|---|
| INT8 uniform | 1.000 | +1.2% |
| PQ (Product Quantization) | 0.61 | +0.8% |
| Strict E8 | 0.18 | +0.06% |
| Relaxed E8 (NexusQuant) | 0.14 | -0.03% |
Relaxed E8 beats strict E8 by 22% MSE reduction. It also beats fp16 on perplexity — compression that makes the model more accurate.
Why This Works at Scale
KV cache vectors are not random. They carry structured information — token relationships, positional encodings, semantic content. After Hadamard rotation, this structure disperses into approximately sub-Gaussian noise, but the near-origin concentration persists across layers and models.
E8 with relaxed parity is not a coincidence. It is the right mathematical structure for the right data distribution. The 8-dimensional optimality of E8 matches the head-dimension granularity of modern transformers (head_dim = 64 or 128, divisible by 8).
The pipeline is three lines of math:
- Normalize and scale (NSN)
- Rotate to sub-Gaussian (Hadamard)
- Quantize to nearest E8 point (relaxed parity)
That is the entire compression stack. No neural networks. No training. No calibration data.
Best regards, João Marques
Top comments (0)