Bharath Kadaluri

Posted on Apr 8

TurboQuant: How a Simple Spin Saves Gigabytes of GPU Memory

#turboquant #attention #transformers #llm

Start With a Restaurant

Before we talk about AI, let me tell you about a busy restaurant

Imagine you manage a popular restaurant. Every order gets written down in full

Order #1: "One Chicken Biryani with extra raita and no onions, 
           One Butter Naan, Two Mango Lassi, Table 5, 7:30 PM"

Order #2: "Two Margherita Pizza with extra cheese and thin crust,
           One Veg Burger with no mayo, One Coke, Table 12, 7:45 PM"

Order #3: "Three Chicken Burger with extra spicy sauce,
           Two Masala Dosa, One Filter Coffee, Table 8, 8:00 PM"

Each order takes ~150 characters.On a busy Friday night with 500 orders, that's 75k characters, which is pages and pages of scribbled notes. The kitchen is drowning in paper.

Now, what if the kitchen agrees on a code system?

Codebook (shared between waiter and kitchen):

  Food:                  Extras:              Drinks:
  CB = Chicken Biryani   +R  = extra raita    ML = Mango Lassi
  MP = Margherita Pizza  +C  = extra cheese   CK = Coke
  VB = Veg Burger        +SP = extra spicy    FC = Filter Coffee
  KB = Chicken Burger    -O  = no onions
  BN = Butter Naan       -M  = no mayo
  MD = Masala Dosa       TC  = thin crust

Now those same orders become:

Order #1:  1×CB+R-O  1×BN  2×ML  T5  730P
Order #2:  2×MP+C,TC  1×VB-M  1×CK  T12  745P
Order #3:  3×KB+SP  2×MD  1×FC  T8  800P

The kitchen decodes instantly: "1×CB+R-O" -> "One Chicken Biryani, extra raita, no onions." No ambiguity, no lost information, just shorter.

Full orders:   ~150 characters each × 500 orders = 75,000 characters
Coded orders:  ~40 characters each  × 500 orders = 20,000 characters

Compression: 3.75x smaller. Same food gets cooked.

This is exactly what TurboQuant does to AI models.

Instead of storing the full number(0.4567689.. 16/32 bits). It stores the KV cache in (1010 4 bits) that maps to a value in shared codebook. The codebook is agreed in advance. This becomes the same for every layer and every conversation

The Trick: Spin, Snap, Pack

TurboQuant compresses each K (or V) vector in the below steps. Let me walk through the entire process on how the compression and decompression happens with a real 4-dimensional vector you can verify by hand.

We'll use 2-bit quantization (4 grid levels) to keep things small. Real models use something like 4-bit (16 levels) and 256 dimensions, but the math is identical.

Our example vector

K = [1.2, -0.8, 0.5, -1.1]

A normal balanced vector, no outliers, keeping it simple to understand

so 4 numbers assuming 4 bytes(float32) each
= 16 bytes
Lets compress it to 3 bytes and then try to bring it back(there will be slight errors/residual)

Step 1 : Measure the norm and Normalize

Measure the magnitude/norm of the vector

norm = sqrt(1.2**2 + 0.8**2 + 0.5**2 +1.1**2)
     = 1.8815

we save the norm as a float16 number which 2 bytes

Now divide the vector by norm to get a unit vector

unit = K / 1.8815 = [+0.6378, -0.4252, +0.2657, -0.5846]

Step 2 : Spin (Random Rotation)

Multiply the unit vector by a random 4×4 rotation matrix R:

Rotation matrix R (generated once from seed=42, never changes):

  [+0.8395, +0.4120, -0.2597, +0.2411]
  [+0.2956, -0.5448, -0.4830, -0.6185]
  [-0.3277, +0.7138, -0.3802, -0.4884]
  [-0.3171, -0.1547, -0.7448, +0.5664]

Each rotated value is a dot product of one row of R with the unit vector:

rotated[0] = (+0.8395 × +0.6378) + (+0.4120 × -0.4252)
           + (-0.2597 × +0.2657) + (+0.2411 × -0.5846)
           = +0.5355 - 0.1752 - 0.0690 - 0.1410
           = +0.1503

rotated[1] = (+0.2956 × +0.6378) + (-0.5448 × -0.4252)
           + (-0.4830 × +0.2657) + (-0.6185 × -0.5846)
           = +0.1886 + 0.2317 - 0.1283 + 0.3615
           = +0.6534

rotated[2] = (-0.3277 × +0.6378) + (+0.7138 × -0.4252)
           + (-0.3802 × +0.2657) + (-0.4884 × -0.5846)
           = -0.2090 - 0.3035 - 0.1010 + 0.2856
           = -0.3280

rotated[3] = (-0.3171 × +0.6378) + (-0.1547 × -0.4252)
           + (-0.7448 × +0.2657) + (+0.5664 × -0.5846)
           = -0.2023 + 0.0658 - 0.1979 - 0.3312
           = -0.6655

Before and after Rotation

Before (unit vector):   [+0.638, -0.425, +0.266, -0.585]
After (rotated):        [+0.150, +0.653, -0.328, -0.666]

Step 3: Snap, Building the Grid

We have 4 rotated numbers: [+0.150, +0.653, -0.328, -0.666]

We need to replace each one with the nearest "allowed" value from a small set. With 2 bits, we get to pick 4 allowed values. But which 4?

Lloyd-Max is the algorithm that finds this smart placement to find the 4 grid values (skipping the math intentionally)

This generates a spread in 4 values

Final codebook:                                                                                                                                                                                                                                
    Level 0: -0.674                                                                                                                                                                                                                              
    Level 1: -0.219                                                                                                                                                                                                                              
    Level 2: +0.219                                                                                                                                                                                                                              
    Level 3: +0.674

Step 4: Snap to Grid

Now snap each rotated value to the nearest codebook level

R-Vector: [+0.150, +0.653, -0.328, -0.666]

CodeBook: [-0.674, -0.219. +0.219, +0.674]
Indices: 0 1 2 3

Lets map the values in the R-vector to the closest vectors

value0 = +0.150 -> nearest Level 2(+0.219) error = 0.069
value1 = +0.653 -> nearest Level 3(+0.674) error = 0.021 <- tiny!
value2 = -0.328 -> nearest Level 1(-0.219) error = 0.109
value3 = -0.666 -> nearest Level 0(-0.674) error = 0.009 <- almost exact!

The R-Vector gets transformed to indices

[+0.150, +0.653, -0.328, -0.666] --> [2,3,1,0]

Step 5 Pack

We need to store this indices --> [2,3,1,0] --> hex 0x1E
which needs 1 byte (00000000) to store the indices 0 to 3

so total bytes = 2 bytes (for storing norm) + 1 byte for indices

Decompression/unpacking

indices to numbers from the codebook [2,3,1,0] --> [+0.219, +0.674, -0.219, -0.674]
unrotate/unspin to get the unrotated vector by multiplying R^T(transpose)

unit_hat[0] = (+0.8395 × +0.219) + (+0.2956 × +0.674)
            + (-0.3277 × -0.219) + (-0.3171 × -0.674)
            = +0.184 + 0.199 + 0.072 + 0.214
            = +0.669

unit_hat[1] = (+0.4120 × +0.219) + (-0.5448 × +0.674)
            + (+0.7138 × -0.219) + (-0.1547 × -0.674)
            = +0.090 - 0.367 - 0.156 + 0.104
            = -0.329

unit_hat[2] = (-0.2597 × +0.219) + (-0.4830 × +0.674)
            + (-0.3802 × -0.219) + (-0.7448 × -0.674)
            = -0.057 - 0.326 + 0.083 + 0.502
            = +0.203

unit_hat[3] = (+0.2411 × +0.219) + (-0.6185 × +0.674)
            + (-0.4884 × -0.219) + (+0.5664 × -0.674)
            = +0.053 - 0.417 + 0.107 - 0.382
            = -0.639

scale back with norm value

K_hat = unit_hat × 1.8815
K_hat = [+1.259, -0.620, +0.382, -1.202]

The Verdict: Original vs Reconstructed

         Original    Reconstructed    Error     Relative
         ────────    ─────────────    ─────     ────────
dim 0:    +1.200      +1.259          0.059      4.9%
dim 1:    -0.800      -0.620          0.180     22.6%
dim 2:    +0.500      +0.382          0.118     23.6%
dim 3:    -1.100      -1.202          0.102      9.3%

Overall relative MSE: 1.7%

Look at that! We crushed a 16-byte vector down to 3 bytes and got it back with 1.7% error, but still there is some difference

Now, lets do the math with 4 bit (16 levels) instead of 2

        Original    Reconstructed    Error     Relative
        ────────    ─────────────    ─────     ────────
dim 0:    +1.200      +1.175          0.025      2.1%
dim 1:    -0.800      -0.716          0.084     10.5%
dim 2:    +0.500      +0.433          0.067     13.4%
dim 3:    -1.100      -1.081          0.019      1.7%

Overall relative MSE: 0.4%

As this is a toy example we have taken, we are seeing almost good results, applying this on a real model with ~256 levels give much more better results

Real Numbers: How Much Memory Does This Save?

I implemented TurboQuant from scratch and tested it on real Gemma 4 models. Here's what we measured:

Gemma 4 E2B (5.1 billion parameters)

At 128K token context:

Without TurboQuant:
  Model weights:  10.25 GB
  KV cache:        2.25 GB
  Total:          12.50 GB

With TurboQuant (4-bit):
  Model weights:  10.25 GB  (unchanged — we only compress the cache)
  KV cache:        0.58 GB
  Total:          10.83 GB

Saved: 1.67 GB of KV cache memory

Which is 3.9x

Extrapolating the numbers

Extending this 3.9x as we increase the context length and the baseline KV, these are approximations

Context Length	Baseline KV	TurboQuant KV	Saved	Compression
8K tokens	2.0 GB	0.5 GB	1.5 GB	3.9x
32K tokens	8.1 GB	2.0 GB	6.0 GB	3.9x
128K tokens	32.2 GB	8.2 GB	24.0 GB	3.9x
256K tokens	64.4 GB	16.4 GB	48.0 GB	3.9x

P.S I have skipped the QJL(Quantized Johnson-Lindenstrauss) for handling the residuals

Wrapping up

Rotate, snap, pack , three operations, 3.9x less KV cache memory, same answers.

References:

DEV Community