DEV Community

Cover image for A Smaller KV Cache Did Not Make Transformers Faster
Alankrit Verma
Alankrit Verma

Posted on

A Smaller KV Cache Did Not Make Transformers Faster

Long-context generation makes the KV cache hard to ignore.

Every generated token reuses keys and values from previous tokens. As the context grows, those cached tensors grow with it. So the natural first idea is simple:

Compress the KV cache, store fewer bytes, and get faster generation.

We tested that idea while exploring TurboQuant-style cache compression in a Hugging Face transformers fork.

Important scope note:

This is not a claim that the official TurboQuant research idea "does not work."

The external context is:

What we tested was narrower:

Can we make a TurboQuant-style compressed-attention path useful inside a local eager transformers implementation?

The first major result was not that a particular backend won.

It was this:

Storage compression and attention execution are different problems.

A cache can become dramatically smaller while generation gets slower.

That single distinction changed the rest of the project.

The Mental Model

In decoder-only generation, each new token uses cached keys and values from previous tokens.

Simplified for one attention head:

a = softmax(q K^T)
o = a V
Enter fullscreen mode Exit fullscreen mode

Where:

  • q is the current query.
  • K is the historical key cache.
  • V is the historical value cache.
  • a is the attention distribution.
  • o is the output contribution from history.

Keys decide where to attend. Values provide the information that gets mixed.

When context length grows, both K and V grow.

So compression can target at least two different things:

  1. Store the cache in fewer bytes.
  2. Execute attention without reconstructing dense historical tensors.

Those sound related. In practice, they are different engineering targets.

The First Measurement

We started with existing cache behavior in transformers.

The baselines were:

  • DynamicCache: dense eager execution.
  • quanto: a strong storage-compression baseline.
  • hqq: another quantized-cache baseline.

The benchmark below used HuggingFaceTB/SmolLM2-135M-Instruct in a roughly 2048-token context generation case.

We measured more than just stored bytes:

  • generation latency
  • stored cache footprint
  • cache bytes per token
  • sampled runtime memory
  • whether generated outputs matched the dense baseline in simple cases
Backend What It Represents Mean Latency Cache Footprint Cache Bytes / Token Runtime Delta Peak
dynamic dense eager baseline 2.250 s 50.911 MiB 23040.0 0.102 GB
quanto strong storage-compression baseline 3.912 s 0.913 MiB 413.3 0.048 GB
hqq alternative quantized-cache baseline 9.770 s 19.133 MiB 8658.6 0.040 GB

The important row is quanto.

It reduced stored cache footprint from:

50.911 MiB -> 0.913 MiB
Enter fullscreen mode Exit fullscreen mode

That is an excellent cache-size result.

But latency went from:

2.250 s -> 3.912 s
Enter fullscreen mode Exit fullscreen mode

So cache storage got much smaller, while generation got slower.

That is not a paradox. It tells us what the backend is optimizing.

Why Smaller Storage Did Not Mean Faster Attention

The current generic quantized-cache shape in transformers is roughly:

  1. Produce new dense keys and values.
  2. Quantize them for storage.
  3. Keep compressed tensors in the cache.
  4. Later dequantize cached tensors.
  5. Return dense keys and values to normal attention.

So the attention implementation still consumes dense tensors.

That means the architecture is:

compressed storage + dense execution

not:

compressed attention

Storage compression versus execution compression

The first design can save cache bytes.

The second design is needed if the goal is to make attention itself faster.

This distinction became the first real output of the project.

Why We Still Looked At TurboQuant-Style Compressed Attention

TurboQuant-style work was interesting because the bigger promise is not simply "store the KV cache with fewer bits."

The stronger target is:

  • store historical keys in a compressed representation
  • compute attention logits using that compressed representation
  • avoid reconstructing every dense historical key each decode step

The ordinary dense key path computes:

logits_t = q . k_t
Enter fullscreen mode Exit fullscreen mode

for every historical token t.

The compressed-key target is closer to:

logits_t ~= compressed_dot(q, code(k_t), residual(k_t))
Enter fullscreen mode Exit fullscreen mode

without materializing every full k_t.

That is an execution-path change.

It requires a different shape than a normal storage-only QuantizedCache backend.

That is why the project became less about "add another cache backend" and more about "change what attention actually consumes."

The Stable Compressed-Key Baseline

We built a stable compressed-key baseline to test that direction.

Internally, we called it reference. For a public reader, the better name is:

the stable compressed-key baseline

Its job was not to be the final optimized system. Its job was to prove that an end-to-end compressed-key attention path could exist in a Llama-style eager stack and provide a consistent comparison point for later experiments.

It kept:

  • compressed historical keys
  • compressed-key attention-logit computation
  • residual correction behavior
  • a full value path so correctness and fidelity stayed interpretable

That baseline survived the project better than the later value-path experiments.

The key lesson was:

The compressed-key path was not where most failures came from.

The failures came from values.

We also saw some directional evidence that compressed-key work might become more interesting as model/context size changes. But that evidence was not clean enough to be the headline result. The safe claim was narrower:

keep the compressed-key baseline as an internal anchor, but do not call it the final system.

Why Values Became The Hard Part

Attention has two major pieces:

  1. Compute attention weights from keys.
  2. Mix values using those weights.

Even if keys are compressed, the output still requires:

o = sum_t a_t v_t
Enter fullscreen mode Exit fullscreen mode

If the implementation still reconstructs or processes values across most of history, the value path remains expensive.

That is exactly what happened.

The project shifted from:

Can we compress the cache?

to:

Can we keep the compressed-key path and make historical value participation structurally cheaper?

That question led to the second half of the work: multiple value-path approximations, most of which failed.

What Part 1 Proved

Part 1 is the architecture lesson.

We learned:

  • Existing quantized cache backends can be very good at reducing stored cache footprint.
  • Stored-cache size is not the same as runtime attention cost.
  • Dense eager execution is a serious baseline because it has a simple hot path.
  • TurboQuant-style compressed-key attention is a different target from storage-only cache compression.
  • The stable compressed-key path was useful enough to keep as an internal baseline.
  • The next bottleneck was historical value mixing.

Part 2 is about what happened when we attacked that value path.

It is the more brutal half of the story.

Notes On Scope

These measurements came from one local fork, one benchmark setup, and a small-model-first workflow. The goal was not to claim universal results for every model and GPU.

The goal was to answer a systems question:

Are we actually reducing attention execution cost, or only cache storage?

For the first phase, the answer was clear.

We had reduced storage.

We had not yet won execution.

This distinction is the reason Part 2 focuses on value-path experiments rather than more storage compression. Once the key path had a stable compressed baseline, the remaining bottleneck was not "can we store fewer bytes?" It was "can we mix historical values cheaply enough without breaking quality?"

Top comments (0)