Long-context generation makes the KV cache hard to ignore.
Every generated token reuses keys and values from previous tokens. As the context grows, those cached tensors grow with it. So the natural first idea is simple:
Compress the KV cache, store fewer bytes, and get faster generation.
We tested that idea while exploring TurboQuant-style cache compression in a Hugging Face transformers fork.
Important scope note:
This is not a claim that the official TurboQuant research idea "does not work."
The external context is:
- Google Research introduced TurboQuant as a compression method for extreme KV-cache and vector compression: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
- The TurboQuant paper describes an online vector quantization approach with residual correction for inner-product preservation: https://arxiv.org/abs/2504.19874
- Hugging Face
transformersexposes several cache strategies, including dynamic and quantized caches: https://huggingface.co/docs/transformers/en/kv_cache
What we tested was narrower:
Can we make a TurboQuant-style compressed-attention path useful inside a local eager
transformersimplementation?
The first major result was not that a particular backend won.
It was this:
Storage compression and attention execution are different problems.
A cache can become dramatically smaller while generation gets slower.
That single distinction changed the rest of the project.
The Mental Model
In decoder-only generation, each new token uses cached keys and values from previous tokens.
Simplified for one attention head:
a = softmax(q K^T)
o = a V
Where:
-
qis the current query. -
Kis the historical key cache. -
Vis the historical value cache. -
ais the attention distribution. -
ois the output contribution from history.
Keys decide where to attend. Values provide the information that gets mixed.
When context length grows, both K and V grow.
So compression can target at least two different things:
- Store the cache in fewer bytes.
- Execute attention without reconstructing dense historical tensors.
Those sound related. In practice, they are different engineering targets.
The First Measurement
We started with existing cache behavior in transformers.
The baselines were:
-
DynamicCache: dense eager execution. -
quanto: a strong storage-compression baseline. -
hqq: another quantized-cache baseline.
The benchmark below used HuggingFaceTB/SmolLM2-135M-Instruct in a roughly 2048-token context generation case.
We measured more than just stored bytes:
- generation latency
- stored cache footprint
- cache bytes per token
- sampled runtime memory
- whether generated outputs matched the dense baseline in simple cases
| Backend | What It Represents | Mean Latency | Cache Footprint | Cache Bytes / Token | Runtime Delta Peak |
|---|---|---|---|---|---|
dynamic |
dense eager baseline | 2.250 s |
50.911 MiB |
23040.0 |
0.102 GB |
quanto |
strong storage-compression baseline | 3.912 s |
0.913 MiB |
413.3 |
0.048 GB |
hqq |
alternative quantized-cache baseline | 9.770 s |
19.133 MiB |
8658.6 |
0.040 GB |
The important row is quanto.
It reduced stored cache footprint from:
50.911 MiB -> 0.913 MiB
That is an excellent cache-size result.
But latency went from:
2.250 s -> 3.912 s
So cache storage got much smaller, while generation got slower.
That is not a paradox. It tells us what the backend is optimizing.
Why Smaller Storage Did Not Mean Faster Attention
The current generic quantized-cache shape in transformers is roughly:
- Produce new dense keys and values.
- Quantize them for storage.
- Keep compressed tensors in the cache.
- Later dequantize cached tensors.
- Return dense keys and values to normal attention.
So the attention implementation still consumes dense tensors.
That means the architecture is:
compressed storage + dense execution
not:
compressed attention
The first design can save cache bytes.
The second design is needed if the goal is to make attention itself faster.
This distinction became the first real output of the project.
Why We Still Looked At TurboQuant-Style Compressed Attention
TurboQuant-style work was interesting because the bigger promise is not simply "store the KV cache with fewer bits."
The stronger target is:
- store historical keys in a compressed representation
- compute attention logits using that compressed representation
- avoid reconstructing every dense historical key each decode step
The ordinary dense key path computes:
logits_t = q . k_t
for every historical token t.
The compressed-key target is closer to:
logits_t ~= compressed_dot(q, code(k_t), residual(k_t))
without materializing every full k_t.
That is an execution-path change.
It requires a different shape than a normal storage-only QuantizedCache backend.
That is why the project became less about "add another cache backend" and more about "change what attention actually consumes."
The Stable Compressed-Key Baseline
We built a stable compressed-key baseline to test that direction.
Internally, we called it reference. For a public reader, the better name is:
the stable compressed-key baseline
Its job was not to be the final optimized system. Its job was to prove that an end-to-end compressed-key attention path could exist in a Llama-style eager stack and provide a consistent comparison point for later experiments.
It kept:
- compressed historical keys
- compressed-key attention-logit computation
- residual correction behavior
- a full value path so correctness and fidelity stayed interpretable
That baseline survived the project better than the later value-path experiments.
The key lesson was:
The compressed-key path was not where most failures came from.
The failures came from values.
We also saw some directional evidence that compressed-key work might become more interesting as model/context size changes. But that evidence was not clean enough to be the headline result. The safe claim was narrower:
keep the compressed-key baseline as an internal anchor, but do not call it the final system.
Why Values Became The Hard Part
Attention has two major pieces:
- Compute attention weights from keys.
- Mix values using those weights.
Even if keys are compressed, the output still requires:
o = sum_t a_t v_t
If the implementation still reconstructs or processes values across most of history, the value path remains expensive.
That is exactly what happened.
The project shifted from:
Can we compress the cache?
to:
Can we keep the compressed-key path and make historical value participation structurally cheaper?
That question led to the second half of the work: multiple value-path approximations, most of which failed.
What Part 1 Proved
Part 1 is the architecture lesson.
We learned:
- Existing quantized cache backends can be very good at reducing stored cache footprint.
- Stored-cache size is not the same as runtime attention cost.
- Dense eager execution is a serious baseline because it has a simple hot path.
- TurboQuant-style compressed-key attention is a different target from storage-only cache compression.
- The stable compressed-key path was useful enough to keep as an internal baseline.
- The next bottleneck was historical value mixing.
Part 2 is about what happened when we attacked that value path.
It is the more brutal half of the story.
Notes On Scope
These measurements came from one local fork, one benchmark setup, and a small-model-first workflow. The goal was not to claim universal results for every model and GPU.
The goal was to answer a systems question:
Are we actually reducing attention execution cost, or only cache storage?
For the first phase, the answer was clear.
We had reduced storage.
We had not yet won execution.
This distinction is the reason Part 2 focuses on value-path experiments rather than more storage compression. Once the key path had a stable compressed baseline, the remaining bottleneck was not "can we store fewer bytes?" It was "can we mix historical values cheaply enough without breaking quality?"

Top comments (0)