Most conversations about scaling large language models focus on obvious factors like model size, training data, and GPU power. While those matter, they stop being the main constraint surprisingly quickly. Once you start dealing with long conversations and many users, memory becomes the limiting factor. Not just how much memory you have, but how efficiently you use it.
This is especially true during inference, when the model is actively generating responses. At that point, the system is not just running computations, it is also constantly reading and writing large amounts of intermediate data. That data, more than anything else, starts to define both cost and speed.
How LLMs actually store words like “cat”
When you type a word like “cat,” the model does not store it as text. It converts it into a vector of numbers, often thousands of values long. These numbers represent a position in a high-dimensional space where similar words are located near each other.
For example, in a simplified form:
cat → [0.21, -1.3, 0.7, 2.1, …]
dog → [0.25, -1.1, 0.6, 2.0, …]
car → [-2.0, 0.5, 1.2, -0.3, …]
Words like “cat” and “dog” end up close together, while “car” is far away. This is how meaning is encoded, not through definitions, but through relationships.
In real models, these vectors are much larger. A common size is around 4096 numbers per token. Each number is typically stored using 16 or 32 bits.
Why one word expands into thousands of numbers
What is less obvious is that the embedding vector is only the starting point. As the token moves through the model, it gets transformed at every layer. At each layer, the model produces two new vectors per token, called keys and values. These are stored in what is known as the KV cache so the model can refer back to earlier tokens.
If you actually calculate how many numbers are stored per token in a typical model, it looks like this:
32 layers × 2 (K + V) × 4096 ≈ 262,000 numbers per token
So a single word like “cat” ends up associated with hundreds of thousands of numbers as it moves through the model.
Why the KV cache consumes so much RAM
Once you understand how many numbers are involved, the memory usage becomes easier to grasp. Consider a conversation with around 2000 tokens. That leads to:
2000 × 262,000 ≈ 524,000,000 numbers
Even if each number is stored in 16 bits (2 bytes), that is roughly:
≈ 1 GB of memory for a single conversation
And this is just for the KV cache. In a real system serving many users simultaneously, this memory usage scales quickly into tens or hundreds of gigabytes.
An additional complication is that moving this data is expensive. In many cases, transferring data from memory is slower than performing the actual mathematical operations. This means memory bandwidth becomes a performance bottleneck.
Why reducing precision alone is not enough
A natural solution is to reduce how many bits are used per number. Instead of 16 or 32 bits, you could use 8 bits. This is a standard technique known as quantization.
It works to some extent, but pushing it too far causes problems. The model relies on subtle numerical relationships, especially in attention calculations. If the numbers become too coarse, those relationships break down and accuracy drops.
The key idea behind TurboQuant: scale plus codes
Instead of storing each number directly, TurboQuant and similar approaches store numbers in a structured way using a scale and a small integer code.
For example, instead of storing:
1.44
you might store:
scale = 0.5
code = 3
and reconstruct:
value ≈ scale × code = 1.5
This allows a small set of integers, represented with very few bits, to approximate a wide range of real values.
Here is a small illustrative example:
Original: [0.2, -0.9, 1.4, 0.6]
scale = 0.47
codes = [0, -2, 3, 1]
Reconstructed ≈ [0, -0.94, 1.41, 0.47]
The numbers are not exact, but they are close enough to preserve the overall structure.
How TurboQuant preserves attention accuracy
The main challenge with aggressive compression is that it can distort dot products, which are central to attention. Attention works by comparing vectors and deciding which previous tokens are most relevant.
TurboQuant addresses this by adding a lightweight correction step that reduces systematic errors introduced during compression. The goal is not to perfectly reconstruct every number, but to preserve the relative ordering of attention scores.
For example, suppose the model computes:
Token A → 5.36
Token B → 3.18
After compression, this might become:
Token A → 5.10
Token B → 3.28
The exact values change, but the ordering stays the same. Token A is still more important than Token B, so the model behaves the same way.
How much RAM TurboQuant actually saves
By reducing the number of bits per value to around 3 bits, TurboQuant can significantly shrink the KV cache. In practice, reported results suggest around a 6× reduction in memory usage.
That means a KV cache that previously required about 1 GB could be reduced to roughly:
≈ 150–200 MB
This has several practical benefits:
- longer context windows become feasible
- more users can be served per GPU
- latency improves due to reduced memory movement
A useful way to think about TurboQuant
One way to understand this is to compare a full transcript with a set of notes. Without compression, the model stores a very detailed numerical record of everything it has seen. With TurboQuant, it stores a compressed version that keeps what matters for future decisions.
The details are not perfectly preserved, but the relationships are. And for the model, that is what matters most.
Why this matters for the future of LLMs
As models continue to grow and context lengths increase, memory and memory bandwidth are becoming central challenges. Techniques like TurboQuant suggest that significant efficiency gains are still possible without changing model architecture or training methods.
At a deeper level, this highlights how LLMs actually work. They operate on vectors and relationships in high-dimensional space. Once you accept that, it becomes clear that exact precision is often unnecessary. What matters is preserving the structure of that space well enough for the model to make correct decisions.
TurboQuant is essentially an answer to that question: how much can we compress these representations while keeping the model’s behavior intact? The answer, at least so far, is that we can compress them far more than most people would expect.
Top comments (0)