Alankrit Verma

Posted on Apr 27

Beating Eager TurboQuant Was Not Enough: Why Dense GPU Attention Still Won

#machinelearning #gpu #research #transformers

I wanted to answer one question:

If I remove eager overhead, can a TurboQuant-style compressed primitive beat dense GPU logits?

The short answer:

it beat eager TurboQuant, but it did not beat dense FP16 logits by enough.

TL;DR

Exact weighted value decode was mathematically clean, but only improved value_decode_sec by about 2.9%.
A fused packed-codebook logits kernel removed most eager overhead and beat eager TurboQuant main logits by 7x to 18x.
It still missed the dense FP16 logits gate: the best K0.2 result was 1.56x at 8192 and 1.99x at 16384, where the gate required >=2x.
The lesson was strict: beating your own eager prototype is not enough. The compressed path must beat the real dense baseline with room left for softmax, values, residuals, and updates.

Evidence

I put the detailed benchmark notes in the public evidence repo:

Results ledger: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md
Exact reference value-decode rewrite: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/exact-reference-value-decode-summary.md
Primitive feasibility probe: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/primitive-feasibility-summary.md
Fused packed-codebook proof: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/fused-kernel-proof-summary.md

The question after eager failed

After the eager value-path failure, there were two possible interpretations:

The compressed-attention idea was weak.
The eager implementation level was weak.

Those are different claims, so they needed different tests.

The first test was exact cleanup: remove algebraic waste without changing the algorithm. If that produced a large win, I could keep improving the stable path.

It did not.

The second test was a primitive upper bound: remove most eager overhead and ask whether the compressed representation could beat dense GPU logits before adding softmax, values, residuals, and model integration.

That is why the fused proof was intentionally narrow. A narrow upper bound is useful because it can kill a bad integration path before the integration work starts.

The setup:

earlier eager value-path experiments failed
that did not prove compressed attention was impossible
it only proved the eager implementation level was weak

The eager value-path work ended at a clean stop:

the eager value-path family failed.

That did not yet prove the compressed-attention idea was bad.

It proved that my eager implementation shape was bad.

So the next question was narrower:

If I remove the obvious eager overhead, can the compressed representation beat dense attention primitives by enough to justify real integration?

The answer was still no.

But the reason is more interesting than "the kernel was slow."

The fused path got much faster than eager TurboQuant. It just did not get fast enough versus dense GPU logits.

First, an exact cleanup

Before kernel work, there was one exact mathematical cleanup to test.

The stable compressed-key baseline had a value path that effectively did this:

decoded_values_t = R^-1(z_t)
o = sum_t a_t decoded_values_t

Here:

z_t is the value representation in rotated space.
R^-1 is the inverse rotation.
a_t is the attention weight for token t.
o is the weighted value output.

Because R^-1 is linear:

sum_t a_t R^-1(z_t) = R^-1(sum_t a_t z_t)

So I can first compute the weighted sum in rotated space:

z_weighted = sum_t a_t z_t

and then inverse-rotate once:

o = R^-1(z_weighted)

This is not a heuristic.

It is exactly equivalent to the existing codec math, up to normal floating-point accumulation details.

For a decode step, this changes the rough output-side rotation count from:

H_kv * T

to:

H_kv * G * Q

Where:

H_kv is the number of key/value heads.
T is the historical context length.
G is the number of query groups per KV head.
Q is the number of query positions.

For a single-token decode with long history, that is a large algebraic reduction.

It was worth doing.

It was not enough.

The cleanup passed correctness and made the value decode path slightly cleaner, but it did not become a real latency win. In the focused benchmark, value_decode_sec improved only about 2.9%.

The lesson:

exact algebraic cleanup is good engineering, but it is not automatically a product-level speedup.

After that, the only serious TurboQuant-family latency question left was a primitive question.

The primitive question

The dense key-logit computation is:

L = Q K^T

For decode, this means comparing the current query against all historical keys.

The compressed-key hope is:

L ~= compressed_logits(Q, codes(K), scales(K), residuals(K))

without materializing full dense historical keys.

This is the part that sounds like the headline TurboQuant promise:

store fewer key bytes
compute attention logits from the compressed representation
avoid dense historical key reads

But the compressed representation has its own costs:

unpacking low-bit codes
codebook lookup
radius or scale multiplication
residual correction
query rotation or transformed query math
extra metadata reads

So the real primitive question was:

Is the compressed representation cheaper to consume than dense FP16 keys?

Not "is it smaller?"

Not "is it faster than my Python/eager version?"

The target was dense GPU logits.

Eager primitive feasibility failed hard

The first primitive feasibility check used synthetic large GQA-style shapes.

It measured whether the current compressed primitive could compete with dense attention/logits at the operation level.

The result was not close.

The compressed reference primitive was roughly:

12x to 24x slower than dense

And encode-one-token cost alone was around:

0.65 ms to 0.84 ms

That killed the idea that more eager PyTorch variants would solve the problem.

But it still left one fair objection:

Of course eager PyTorch lost. What if a fused kernel removes the unpack and codebook overhead?

That objection was valid.

So I wrote the fused proof.

The fused K0 proof

The fused proof intentionally started small.

It did not implement full attention.

It did not implement values.

It did not implement residual correction.

It did not integrate into model generation.

It only implemented packed-codebook main logits.

That made it an upper-bound test.

If main logits alone could not beat dense logits by enough, then full attention would not have enough room either.

The kernel did:

packed int4 unpack
codebook lookup
radius multiplication
dot products against query groups
output logits

The benchmark compared:

dense logits

against:

fused packed-codebook main logits

The important target was not eager TurboQuant.

The important target was dense logits.

The kernel worked, but the path still failed

The fused kernel did remove a lot of eager overhead.

On the main-logits operation, it was roughly:

7x to 18x faster than eager TurboQuant main logits

That is a real engineering improvement.

But the dense baseline was extremely strong.

For the llama70b_gqa synthetic profile, the corrected float16 output K0.2 sweep looked like this:

Block Tokens	Context	Dense Logits	Fused Main Logits	Fused vs Dense	Required	Gate
`32`	`8192`	`0.061 ms`	`0.048 ms`	`1.27x`	`>=2.0x`	fail
`32`	`16384`	`0.093 ms`	`0.049 ms`	`1.90x`	`>=2.0x`	fail
`64`	`8192`	`0.060 ms`	`0.039 ms`	`1.56x`	`>=2.0x`	fail
`64`	`16384`	`0.092 ms`	`0.046 ms`	`1.99x`	`>=2.0x`	near miss
`128`	`8192`	`0.058 ms`	`0.108 ms`	`0.54x`	`>=2.0x`	fail
`256`	`8192`	`0.059 ms`	`0.043 ms`	`1.38x`	`>=2.0x`	fail

The best tile was block_tokens=64.

It nearly reached 2x at 16k.

It did not clear the bar at 8k.

And this was still only main logits.

Full attention would add:

online softmax
value accumulation
residual correction
cache update cost
model integration overhead

So a near miss on logits-only at 16k was not enough.

The disciplined decision was:

stop the packed-codebook fused-kernel path.

Why dense logits were so hard to beat

Dense FP16 logits have a boring shape, and that is exactly why they are strong.

They use:

contiguous FP16 data
regular tensor operations
optimized GPU paths
no unpacking
no codebook lookup
no token-specific scale reconstruction

The packed-codebook path stores fewer bytes, but each byte is not immediately useful math.

It has to be unpacked and interpreted.

That interpretation cost is the whole fight.

In my kernel, there was another practical issue: query-group padding.

The kernel used tl.dot, which wants a tile dimension of at least 16.

For GQA shapes:

llama70b_gqa had G=8, so half the tile was padding.
llama8b_gqa had G=4, so three quarters of the tile was padding.

The lower-level kernel removed Python overhead, but it did not change the fact that the dense baseline had a cleaner GPU execution shape.

What about the TurboQuant claim?

This is where benchmark language matters.

The external TurboQuant claim is not the same as what I was trying to ship.

For context, the public references are:

Google Research blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
TurboQuant paper: https://arxiv.org/abs/2504.19874

The public TurboQuant framing includes:

extreme KV/vector compression
quality retention at low bits
faster attention-logit computation under a specialized setup

That is not identical to:

drop a cache implementation into Hugging Face generate()
and beat dense FP16/BF16 end-to-end decode

Those are different bars.

In particular:

attention logits are only one part of full decode
a primitive result does not include cache update, value path, model layers, or generation overhead
comparing against FP32 unquantized keys is not the same as comparing against an optimized FP16/BF16 dense path
H100/JAX-style specialized kernels are not the same environment as a general local transformers fork

So this work does not prove:

TurboQuant is wrong.

It proves:

my packed-codebook TurboQuant-family path did not have enough room to become a full decode-latency win in this repo.

That distinction is important.

It is also why the failure was useful.

It stopped me from turning a near-miss primitive into months of residual/value/model integration work.

The reusable lesson

The main lesson from this stage was:

Beating your own unoptimized implementation is not enough.

A compressed path has to beat the real target baseline.

For GPU decode, the real target baseline is dense optimized attention/logits, not the first eager prototype.

The second lesson:

A logits-only win needs enough margin to pay for the rest of attention.

If the best logits-only result barely reaches the required threshold at one context and misses at another, full attention is not going to improve the situation for free.

Where this left the project

After this stage, the packed-codebook TurboQuant latency path was closed.

Not because compression was fake.

Not because the math was useless.

Because the compressed representation was not cheap enough to consume compared with dense FP16/BF16 GPU math.

That left two possible pivots:

Try a more hardware-friendly KV representation.
Stop compressing values and instead reduce dense attention work.

Those became the final experiments.

They also had to pass the same discipline: bytes, speed ceilings, and quality are separate gates.

What I learned

If you only remember one thing from this post, it should be this:

A compressed kernel has to beat the real dense baseline, not just the unoptimized compressed prototype.

The fused logits kernel was useful because it answered that question before I spent time on residual correction, value accumulation, and model integration.

DEV Community