Start With a Restaurant
Before we talk about AI, let me tell you about a busy restaurant
Imagine you manage a popular restaurant. Every order gets written down in full
Order #1: "One Chicken Biryani with extra raita and no onions,
One Butter Naan, Two Mango Lassi, Table 5, 7:30 PM"
Order #2: "Two Margherita Pizza with extra cheese and thin crust,
One Veg Burger with no mayo, One Coke, Table 12, 7:45 PM"
Order #3: "Three Chicken Burger with extra spicy sauce,
Two Masala Dosa, One Filter Coffee, Table 8, 8:00 PM"
Each order takes ~150 characters.On a busy Friday night with 500 orders, that's 75k characters, which is pages and pages of scribbled notes. The kitchen is drowning in paper.
Now, what if the kitchen agrees on a code system?
Codebook (shared between waiter and kitchen):
Food: Extras: Drinks:
CB = Chicken Biryani +R = extra raita ML = Mango Lassi
MP = Margherita Pizza +C = extra cheese CK = Coke
VB = Veg Burger +SP = extra spicy FC = Filter Coffee
KB = Chicken Burger -O = no onions
BN = Butter Naan -M = no mayo
MD = Masala Dosa TC = thin crust
Now those same orders become:
Order #1: 1×CB+R-O 1×BN 2×ML T5 730P
Order #2: 2×MP+C,TC 1×VB-M 1×CK T12 745P
Order #3: 3×KB+SP 2×MD 1×FC T8 800P
The kitchen decodes instantly: "1×CB+R-O" -> "One Chicken Biryani, extra raita, no onions." No ambiguity, no lost information, just shorter.
Full orders: ~150 characters each × 500 orders = 75,000 characters
Coded orders: ~40 characters each × 500 orders = 20,000 characters
Compression: 3.75x smaller. Same food gets cooked.
This is exactly what TurboQuant does to AI models.
Instead of storing the full number(0.4567689.. 16/32 bits). It stores the KV cache in (1010 4 bits) that maps to a value in shared codebook. The codebook is agreed in advance. This becomes the same for every layer and every conversation
The Trick: Spin, Snap, Pack
TurboQuant compresses each K (or V) vector in the below steps. Let me walk through the entire process on how the compression and decompression happens with a real 4-dimensional vector you can verify by hand.
We'll use 2-bit quantization (4 grid levels) to keep things small. Real models use something like 4-bit (16 levels) and 256 dimensions, but the math is identical.
Our example vector
K = [1.2, -0.8, 0.5, -1.1]
A normal balanced vector, no outliers, keeping it simple to understand
so 4 numbers assuming 4 bytes(float32) each
= 16 bytes
Lets compress it to 3 bytes and then try to bring it back(there will be slight errors/residual)
Step 1 : Measure the norm and Normalize
Measure the magnitude/norm of the vector
norm = sqrt(1.2**2 + 0.8**2 + 0.5**2 +1.1**2)
= 1.8815
we save the norm as a float16 number which 2 bytes
Now divide the vector by norm to get a unit vector
unit = K / 1.8815 = [+0.6378, -0.4252, +0.2657, -0.5846]
Step 2 : Spin (Random Rotation)
Multiply the unit vector by a random 4×4 rotation matrix R:
Rotation matrix R (generated once from seed=42, never changes):
[+0.8395, +0.4120, -0.2597, +0.2411]
[+0.2956, -0.5448, -0.4830, -0.6185]
[-0.3277, +0.7138, -0.3802, -0.4884]
[-0.3171, -0.1547, -0.7448, +0.5664]
Each rotated value is a dot product of one row of R with the unit vector:
rotated[0] = (+0.8395 × +0.6378) + (+0.4120 × -0.4252)
+ (-0.2597 × +0.2657) + (+0.2411 × -0.5846)
= +0.5355 - 0.1752 - 0.0690 - 0.1410
= +0.1503
rotated[1] = (+0.2956 × +0.6378) + (-0.5448 × -0.4252)
+ (-0.4830 × +0.2657) + (-0.6185 × -0.5846)
= +0.1886 + 0.2317 - 0.1283 + 0.3615
= +0.6534
rotated[2] = (-0.3277 × +0.6378) + (+0.7138 × -0.4252)
+ (-0.3802 × +0.2657) + (-0.4884 × -0.5846)
= -0.2090 - 0.3035 - 0.1010 + 0.2856
= -0.3280
rotated[3] = (-0.3171 × +0.6378) + (-0.1547 × -0.4252)
+ (-0.7448 × +0.2657) + (+0.5664 × -0.5846)
= -0.2023 + 0.0658 - 0.1979 - 0.3312
= -0.6655
Before and after Rotation
Before (unit vector): [+0.638, -0.425, +0.266, -0.585]
After (rotated): [+0.150, +0.653, -0.328, -0.666]
Step 3: Snap, Building the Grid
We have 4 rotated numbers: [+0.150, +0.653, -0.328, -0.666]
We need to replace each one with the nearest "allowed" value from a small set. With 2 bits, we get to pick 4 allowed values. But which 4?
Lloyd-Max is the algorithm that finds this smart placement to find the 4 grid values (skipping the math intentionally)
This generates a spread in 4 values
Final codebook:
Level 0: -0.674
Level 1: -0.219
Level 2: +0.219
Level 3: +0.674
Step 4: Snap to Grid
Now snap each rotated value to the nearest codebook level
R-Vector: [+0.150, +0.653, -0.328, -0.666]
CodeBook: [-0.674, -0.219. +0.219, +0.674]
Indices: 0 1 2 3
Lets map the values in the R-vector to the closest vectors
value0 = +0.150 -> nearest Level 2(+0.219) error = 0.069
value1 = +0.653 -> nearest Level 3(+0.674) error = 0.021 <- tiny!
value2 = -0.328 -> nearest Level 1(-0.219) error = 0.109
value3 = -0.666 -> nearest Level 0(-0.674) error = 0.009 <- almost exact!
The R-Vector gets transformed to indices
[+0.150, +0.653, -0.328, -0.666] --> [2,3,1,0]
Step 5 Pack
We need to store this indices --> [2,3,1,0] --> hex 0x1E
which needs 1 byte (00000000) to store the indices 0 to 3
so total bytes = 2 bytes (for storing norm) + 1 byte for indices
Decompression/unpacking
- indices to numbers from the codebook [2,3,1,0] --> [+0.219, +0.674, -0.219, -0.674]
- unrotate/unspin to get the unrotated vector by multiplying R^T(transpose)
unit_hat[0] = (+0.8395 × +0.219) + (+0.2956 × +0.674)
+ (-0.3277 × -0.219) + (-0.3171 × -0.674)
= +0.184 + 0.199 + 0.072 + 0.214
= +0.669
unit_hat[1] = (+0.4120 × +0.219) + (-0.5448 × +0.674)
+ (+0.7138 × -0.219) + (-0.1547 × -0.674)
= +0.090 - 0.367 - 0.156 + 0.104
= -0.329
unit_hat[2] = (-0.2597 × +0.219) + (-0.4830 × +0.674)
+ (-0.3802 × -0.219) + (-0.7448 × -0.674)
= -0.057 - 0.326 + 0.083 + 0.502
= +0.203
unit_hat[3] = (+0.2411 × +0.219) + (-0.6185 × +0.674)
+ (-0.4884 × -0.219) + (+0.5664 × -0.674)
= +0.053 - 0.417 + 0.107 - 0.382
= -0.639
- scale back with norm value
K_hat = unit_hat × 1.8815
K_hat = [+1.259, -0.620, +0.382, -1.202]
The Verdict: Original vs Reconstructed
Original Reconstructed Error Relative
──────── ───────────── ───── ────────
dim 0: +1.200 +1.259 0.059 4.9%
dim 1: -0.800 -0.620 0.180 22.6%
dim 2: +0.500 +0.382 0.118 23.6%
dim 3: -1.100 -1.202 0.102 9.3%
Overall relative MSE: 1.7%
Look at that! We crushed a 16-byte vector down to 3 bytes and got it back with 1.7% error, but still there is some difference
Now, lets do the math with 4 bit (16 levels) instead of 2
Original Reconstructed Error Relative
──────── ───────────── ───── ────────
dim 0: +1.200 +1.175 0.025 2.1%
dim 1: -0.800 -0.716 0.084 10.5%
dim 2: +0.500 +0.433 0.067 13.4%
dim 3: -1.100 -1.081 0.019 1.7%
Overall relative MSE: 0.4%
As this is a toy example we have taken, we are seeing almost good results, applying this on a real model with ~256 levels give much more better results
Real Numbers: How Much Memory Does This Save?
I implemented TurboQuant from scratch and tested it on real Gemma 4 models. Here's what we measured:
Gemma 4 E2B (5.1 billion parameters)
At 128K token context:
Without TurboQuant:
Model weights: 10.25 GB
KV cache: 2.25 GB
Total: 12.50 GB
With TurboQuant (4-bit):
Model weights: 10.25 GB (unchanged — we only compress the cache)
KV cache: 0.58 GB
Total: 10.83 GB
Saved: 1.67 GB of KV cache memory
Which is 3.9x
Extrapolating the numbers
Extending this 3.9x as we increase the context length and the baseline KV, these are approximations
| Context Length | Baseline KV | TurboQuant KV | Saved | Compression |
|---|---|---|---|---|
| 8K tokens | 2.0 GB | 0.5 GB | 1.5 GB | 3.9x |
| 32K tokens | 8.1 GB | 2.0 GB | 6.0 GB | 3.9x |
| 128K tokens | 32.2 GB | 8.2 GB | 24.0 GB | 3.9x |
| 256K tokens | 64.4 GB | 16.4 GB | 48.0 GB | 3.9x |
P.S I have skipped the QJL(Quantized Johnson-Lindenstrauss) for handling the residuals
Wrapping up
Rotate, snap, pack , three operations, 3.9x less KV cache memory, same answers.
References:


Top comments (0)