DEV Community

Cover image for TurboQuant: How a Simple Spin Saves Gigabytes of GPU Memory
Bharath Kadaluri
Bharath Kadaluri

Posted on

TurboQuant: How a Simple Spin Saves Gigabytes of GPU Memory

Start With a Restaurant

Before we talk about AI, let me tell you about a busy restaurant

Imagine you manage a popular restaurant. Every order gets written down in full

Order #1: "One Chicken Biryani with extra raita and no onions, 
           One Butter Naan, Two Mango Lassi, Table 5, 7:30 PM"

Order #2: "Two Margherita Pizza with extra cheese and thin crust,
           One Veg Burger with no mayo, One Coke, Table 12, 7:45 PM"

Order #3: "Three Chicken Burger with extra spicy sauce,
           Two Masala Dosa, One Filter Coffee, Table 8, 8:00 PM"
Enter fullscreen mode Exit fullscreen mode

Each order takes ~150 characters.On a busy Friday night with 500 orders, that's 75k characters, which is pages and pages of scribbled notes. The kitchen is drowning in paper.

Now, what if the kitchen agrees on a code system?

Codebook (shared between waiter and kitchen):

  Food:                  Extras:              Drinks:
  CB = Chicken Biryani   +R  = extra raita    ML = Mango Lassi
  MP = Margherita Pizza  +C  = extra cheese   CK = Coke
  VB = Veg Burger        +SP = extra spicy    FC = Filter Coffee
  KB = Chicken Burger    -O  = no onions
  BN = Butter Naan       -M  = no mayo
  MD = Masala Dosa       TC  = thin crust
Enter fullscreen mode Exit fullscreen mode

Now those same orders become:

Order #1:  1×CB+R-O  1×BN  2×ML  T5  730P
Order #2:  2×MP+C,TC  1×VB-M  1×CK  T12  745P
Order #3:  3×KB+SP  2×MD  1×FC  T8  800P
Enter fullscreen mode Exit fullscreen mode

The kitchen decodes instantly: "1×CB+R-O" -> "One Chicken Biryani, extra raita, no onions." No ambiguity, no lost information, just shorter.

Full orders:   ~150 characters each × 500 orders = 75,000 characters
Coded orders:  ~40 characters each  × 500 orders = 20,000 characters
Enter fullscreen mode Exit fullscreen mode

Compression: 3.75x smaller. Same food gets cooked.

This is exactly what TurboQuant does to AI models.

Instead of storing the full number(0.4567689.. 16/32 bits). It stores the KV cache in (1010 4 bits) that maps to a value in shared codebook. The codebook is agreed in advance. This becomes the same for every layer and every conversation

The Trick: Spin, Snap, Pack

TurboQuant compresses each K (or V) vector in the below steps. Let me walk through the entire process on how the compression and decompression happens with a real 4-dimensional vector you can verify by hand.

We'll use 2-bit quantization (4 grid levels) to keep things small. Real models use something like 4-bit (16 levels) and 256 dimensions, but the math is identical.

Our example vector

K = [1.2, -0.8, 0.5, -1.1]
Enter fullscreen mode Exit fullscreen mode

A normal balanced vector, no outliers, keeping it simple to understand

so 4 numbers assuming 4 bytes(float32) each
= 16 bytes
Lets compress it to 3 bytes and then try to bring it back(there will be slight errors/residual)

Step 1 : Measure the norm and Normalize

Measure the magnitude/norm of the vector

norm = sqrt(1.2**2 + 0.8**2 + 0.5**2 +1.1**2)
     = 1.8815
Enter fullscreen mode Exit fullscreen mode

we save the norm as a float16 number which 2 bytes

Now divide the vector by norm to get a unit vector

unit = K / 1.8815 = [+0.6378, -0.4252, +0.2657, -0.5846]
Enter fullscreen mode Exit fullscreen mode

Step 2 : Spin (Random Rotation)

Multiply the unit vector by a random 4×4 rotation matrix R:

Rotation matrix R (generated once from seed=42, never changes):

  [+0.8395, +0.4120, -0.2597, +0.2411]
  [+0.2956, -0.5448, -0.4830, -0.6185]
  [-0.3277, +0.7138, -0.3802, -0.4884]
  [-0.3171, -0.1547, -0.7448, +0.5664]
Enter fullscreen mode Exit fullscreen mode

Each rotated value is a dot product of one row of R with the unit vector:

rotated[0] = (+0.8395 × +0.6378) + (+0.4120 × -0.4252)
           + (-0.2597 × +0.2657) + (+0.2411 × -0.5846)
           = +0.5355 - 0.1752 - 0.0690 - 0.1410
           = +0.1503

rotated[1] = (+0.2956 × +0.6378) + (-0.5448 × -0.4252)
           + (-0.4830 × +0.2657) + (-0.6185 × -0.5846)
           = +0.1886 + 0.2317 - 0.1283 + 0.3615
           = +0.6534

rotated[2] = (-0.3277 × +0.6378) + (+0.7138 × -0.4252)
           + (-0.3802 × +0.2657) + (-0.4884 × -0.5846)
           = -0.2090 - 0.3035 - 0.1010 + 0.2856
           = -0.3280

rotated[3] = (-0.3171 × +0.6378) + (-0.1547 × -0.4252)
           + (-0.7448 × +0.2657) + (+0.5664 × -0.5846)
           = -0.2023 + 0.0658 - 0.1979 - 0.3312
           = -0.6655
Enter fullscreen mode Exit fullscreen mode

Before and after Rotation

Before (unit vector):   [+0.638, -0.425, +0.266, -0.585]
After (rotated):        [+0.150, +0.653, -0.328, -0.666]
Enter fullscreen mode Exit fullscreen mode

Step 3: Snap, Building the Grid

We have 4 rotated numbers: [+0.150, +0.653, -0.328, -0.666]

We need to replace each one with the nearest "allowed" value from a small set. With 2 bits, we get to pick 4 allowed values. But which 4?

Lloyd-Max is the algorithm that finds this smart placement to find the 4 grid values (skipping the math intentionally)

This generates a spread in 4 values

Final codebook:                                                                                                                                                                                                                                
    Level 0: -0.674                                                                                                                                                                                                                              
    Level 1: -0.219                                                                                                                                                                                                                              
    Level 2: +0.219                                                                                                                                                                                                                              
    Level 3: +0.674 
Enter fullscreen mode Exit fullscreen mode

Step 4: Snap to Grid

Now snap each rotated value to the nearest codebook level

R-Vector: [+0.150, +0.653, -0.328, -0.666]

CodeBook: [-0.674, -0.219. +0.219, +0.674]
Indices: 0 1 2 3

Lets map the values in the R-vector to the closest vectors

value0 = +0.150 -> nearest Level 2(+0.219) error = 0.069
value1 = +0.653 -> nearest Level 3(+0.674) error = 0.021 <- tiny!
value2 = -0.328 -> nearest Level 1(-0.219) error = 0.109
value3 = -0.666 -> nearest Level 0(-0.674) error = 0.009 <- almost exact!
Enter fullscreen mode Exit fullscreen mode

The R-Vector gets transformed to indices

[+0.150, +0.653, -0.328, -0.666] --> [2,3,1,0]

Step 5 Pack

We need to store this indices --> [2,3,1,0] --> hex 0x1E
which needs 1 byte (00000000) to store the indices 0 to 3

so total bytes = 2 bytes (for storing norm) + 1 byte for indices

Decompression/unpacking

  • indices to numbers from the codebook [2,3,1,0] --> [+0.219, +0.674, -0.219, -0.674]
  • unrotate/unspin to get the unrotated vector by multiplying R^T(transpose)
unit_hat[0] = (+0.8395 × +0.219) + (+0.2956 × +0.674)
            + (-0.3277 × -0.219) + (-0.3171 × -0.674)
            = +0.184 + 0.199 + 0.072 + 0.214
            = +0.669

unit_hat[1] = (+0.4120 × +0.219) + (-0.5448 × +0.674)
            + (+0.7138 × -0.219) + (-0.1547 × -0.674)
            = +0.090 - 0.367 - 0.156 + 0.104
            = -0.329

unit_hat[2] = (-0.2597 × +0.219) + (-0.4830 × +0.674)
            + (-0.3802 × -0.219) + (-0.7448 × -0.674)
            = -0.057 - 0.326 + 0.083 + 0.502
            = +0.203

unit_hat[3] = (+0.2411 × +0.219) + (-0.6185 × +0.674)
            + (-0.4884 × -0.219) + (+0.5664 × -0.674)
            = +0.053 - 0.417 + 0.107 - 0.382
            = -0.639
Enter fullscreen mode Exit fullscreen mode
  • scale back with norm value
K_hat = unit_hat × 1.8815
K_hat = [+1.259, -0.620, +0.382, -1.202]
Enter fullscreen mode Exit fullscreen mode

The Verdict: Original vs Reconstructed

         Original    Reconstructed    Error     Relative
         ────────    ─────────────    ─────     ────────
dim 0:    +1.200      +1.259          0.059      4.9%
dim 1:    -0.800      -0.620          0.180     22.6%
dim 2:    +0.500      +0.382          0.118     23.6%
dim 3:    -1.100      -1.202          0.102      9.3%

Overall relative MSE: 1.7%
Enter fullscreen mode Exit fullscreen mode

Look at that! We crushed a 16-byte vector down to 3 bytes and got it back with 1.7% error, but still there is some difference

Now, lets do the math with 4 bit (16 levels) instead of 2

        Original    Reconstructed    Error     Relative
        ────────    ─────────────    ─────     ────────
dim 0:    +1.200      +1.175          0.025      2.1%
dim 1:    -0.800      -0.716          0.084     10.5%
dim 2:    +0.500      +0.433          0.067     13.4%
dim 3:    -1.100      -1.081          0.019      1.7%

Overall relative MSE: 0.4%
Enter fullscreen mode Exit fullscreen mode

As this is a toy example we have taken, we are seeing almost good results, applying this on a real model with ~256 levels give much more better results

Real Numbers: How Much Memory Does This Save?

I implemented TurboQuant from scratch and tested it on real Gemma 4 models. Here's what we measured:

Gemma 4 E2B (5.1 billion parameters)

At 128K token context:

Without TurboQuant:
  Model weights:  10.25 GB
  KV cache:        2.25 GB
  Total:          12.50 GB

With TurboQuant (4-bit):
  Model weights:  10.25 GB  (unchanged — we only compress the cache)
  KV cache:        0.58 GB
  Total:          10.83 GB

Saved: 1.67 GB of KV cache memory
Enter fullscreen mode Exit fullscreen mode

Which is 3.9x

Extrapolating the numbers

Extending this 3.9x as we increase the context length and the baseline KV, these are approximations

Context Length Baseline KV TurboQuant KV Saved Compression
8K tokens 2.0 GB 0.5 GB 1.5 GB 3.9x
32K tokens 8.1 GB 2.0 GB 6.0 GB 3.9x
128K tokens 32.2 GB 8.2 GB 24.0 GB 3.9x
256K tokens 64.4 GB 16.4 GB 48.0 GB 3.9x

P.S I have skipped the QJL(Quantized Johnson-Lindenstrauss) for handling the residuals

Wrapping up

Rotate, snap, pack , three operations, 3.9x less KV cache memory, same answers.

References:

Top comments (0)