I wanted to answer one question:
If I remove eager overhead, can a TurboQuant-style compressed primitive beat dense GPU logits?
The short answer:
it beat eager TurboQuant, but it did not beat dense FP16 logits by enough.
TL;DR
- Exact weighted value decode was mathematically clean, but only improved
value_decode_secby about2.9%. - A fused packed-codebook logits kernel removed most eager overhead and beat eager TurboQuant main logits by
7xto18x. - It still missed the dense FP16 logits gate: the best K0.2 result was
1.56xat 8192 and1.99xat 16384, where the gate required>=2x. - The lesson was strict: beating your own eager prototype is not enough. The compressed path must beat the real dense baseline with room left for softmax, values, residuals, and updates.
Evidence
I put the detailed benchmark notes in the public evidence repo:
- Results ledger: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md
- Exact reference value-decode rewrite: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/exact-reference-value-decode-summary.md
- Primitive feasibility probe: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/primitive-feasibility-summary.md
- Fused packed-codebook proof: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/fused-kernel-proof-summary.md
The question after eager failed
After the eager value-path failure, there were two possible interpretations:
- The compressed-attention idea was weak.
- The eager implementation level was weak.
Those are different claims, so they needed different tests.
The first test was exact cleanup: remove algebraic waste without changing the algorithm. If that produced a large win, I could keep improving the stable path.
It did not.
The second test was a primitive upper bound: remove most eager overhead and ask whether the compressed representation could beat dense GPU logits before adding softmax, values, residuals, and model integration.
That is why the fused proof was intentionally narrow. A narrow upper bound is useful because it can kill a bad integration path before the integration work starts.
The setup:
- earlier eager value-path experiments failed
- that did not prove compressed attention was impossible
- it only proved the eager implementation level was weak
The eager value-path work ended at a clean stop:
the eager value-path family failed.
That did not yet prove the compressed-attention idea was bad.
It proved that my eager implementation shape was bad.
So the next question was narrower:
If I remove the obvious eager overhead, can the compressed representation beat dense attention primitives by enough to justify real integration?
The answer was still no.
But the reason is more interesting than "the kernel was slow."
The fused path got much faster than eager TurboQuant. It just did not get fast enough versus dense GPU logits.
First, an exact cleanup
Before kernel work, there was one exact mathematical cleanup to test.
The stable compressed-key baseline had a value path that effectively did this:
decoded_values_t = R^-1(z_t)
o = sum_t a_t decoded_values_t
Here:
-
z_tis the value representation in rotated space. -
R^-1is the inverse rotation. -
a_tis the attention weight for tokent. -
ois the weighted value output.
Because R^-1 is linear:
sum_t a_t R^-1(z_t) = R^-1(sum_t a_t z_t)
So I can first compute the weighted sum in rotated space:
z_weighted = sum_t a_t z_t
and then inverse-rotate once:
o = R^-1(z_weighted)
This is not a heuristic.
It is exactly equivalent to the existing codec math, up to normal floating-point accumulation details.
For a decode step, this changes the rough output-side rotation count from:
H_kv * T
to:
H_kv * G * Q
Where:
-
H_kvis the number of key/value heads. -
Tis the historical context length. -
Gis the number of query groups per KV head. -
Qis the number of query positions.
For a single-token decode with long history, that is a large algebraic reduction.
It was worth doing.
It was not enough.
The cleanup passed correctness and made the value decode path slightly cleaner, but it did not become a real latency win. In the focused benchmark, value_decode_sec improved only about 2.9%.
The lesson:
exact algebraic cleanup is good engineering, but it is not automatically a product-level speedup.
After that, the only serious TurboQuant-family latency question left was a primitive question.
The primitive question
The dense key-logit computation is:
L = Q K^T
For decode, this means comparing the current query against all historical keys.
The compressed-key hope is:
L ~= compressed_logits(Q, codes(K), scales(K), residuals(K))
without materializing full dense historical keys.
This is the part that sounds like the headline TurboQuant promise:
- store fewer key bytes
- compute attention logits from the compressed representation
- avoid dense historical key reads
But the compressed representation has its own costs:
- unpacking low-bit codes
- codebook lookup
- radius or scale multiplication
- residual correction
- query rotation or transformed query math
- extra metadata reads
So the real primitive question was:
Is the compressed representation cheaper to consume than dense FP16 keys?
Not "is it smaller?"
Not "is it faster than my Python/eager version?"
The target was dense GPU logits.
Eager primitive feasibility failed hard
The first primitive feasibility check used synthetic large GQA-style shapes.
It measured whether the current compressed primitive could compete with dense attention/logits at the operation level.
The result was not close.
The compressed reference primitive was roughly:
12x to 24x slower than dense
And encode-one-token cost alone was around:
0.65 ms to 0.84 ms
That killed the idea that more eager PyTorch variants would solve the problem.
But it still left one fair objection:
Of course eager PyTorch lost. What if a fused kernel removes the unpack and codebook overhead?
That objection was valid.
So I wrote the fused proof.
The fused K0 proof
The fused proof intentionally started small.
It did not implement full attention.
It did not implement values.
It did not implement residual correction.
It did not integrate into model generation.
It only implemented packed-codebook main logits.
That made it an upper-bound test.
If main logits alone could not beat dense logits by enough, then full attention would not have enough room either.
The kernel did:
- packed int4 unpack
- codebook lookup
- radius multiplication
- dot products against query groups
- output logits
The benchmark compared:
dense logits
against:
fused packed-codebook main logits
The important target was not eager TurboQuant.
The important target was dense logits.
The kernel worked, but the path still failed
The fused kernel did remove a lot of eager overhead.
On the main-logits operation, it was roughly:
7x to 18x faster than eager TurboQuant main logits
That is a real engineering improvement.
But the dense baseline was extremely strong.
For the llama70b_gqa synthetic profile, the corrected float16 output K0.2 sweep looked like this:
| Block Tokens | Context | Dense Logits | Fused Main Logits | Fused vs Dense | Required | Gate |
|---|---|---|---|---|---|---|
32 |
8192 |
0.061 ms |
0.048 ms |
1.27x |
>=2.0x |
fail |
32 |
16384 |
0.093 ms |
0.049 ms |
1.90x |
>=2.0x |
fail |
64 |
8192 |
0.060 ms |
0.039 ms |
1.56x |
>=2.0x |
fail |
64 |
16384 |
0.092 ms |
0.046 ms |
1.99x |
>=2.0x |
near miss |
128 |
8192 |
0.058 ms |
0.108 ms |
0.54x |
>=2.0x |
fail |
256 |
8192 |
0.059 ms |
0.043 ms |
1.38x |
>=2.0x |
fail |
The best tile was block_tokens=64.
It nearly reached 2x at 16k.
It did not clear the bar at 8k.
And this was still only main logits.
Full attention would add:
- online softmax
- value accumulation
- residual correction
- cache update cost
- model integration overhead
So a near miss on logits-only at 16k was not enough.
The disciplined decision was:
stop the packed-codebook fused-kernel path.
Why dense logits were so hard to beat
Dense FP16 logits have a boring shape, and that is exactly why they are strong.
They use:
- contiguous FP16 data
- regular tensor operations
- optimized GPU paths
- no unpacking
- no codebook lookup
- no token-specific scale reconstruction
The packed-codebook path stores fewer bytes, but each byte is not immediately useful math.
It has to be unpacked and interpreted.
That interpretation cost is the whole fight.
In my kernel, there was another practical issue: query-group padding.
The kernel used tl.dot, which wants a tile dimension of at least 16.
For GQA shapes:
-
llama70b_gqahadG=8, so half the tile was padding. -
llama8b_gqahadG=4, so three quarters of the tile was padding.
The lower-level kernel removed Python overhead, but it did not change the fact that the dense baseline had a cleaner GPU execution shape.
What about the TurboQuant claim?
This is where benchmark language matters.
The external TurboQuant claim is not the same as what I was trying to ship.
For context, the public references are:
- Google Research blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
- TurboQuant paper: https://arxiv.org/abs/2504.19874
The public TurboQuant framing includes:
- extreme KV/vector compression
- quality retention at low bits
- faster attention-logit computation under a specialized setup
That is not identical to:
drop a cache implementation into Hugging Face generate()
and beat dense FP16/BF16 end-to-end decode
Those are different bars.
In particular:
- attention logits are only one part of full decode
- a primitive result does not include cache update, value path, model layers, or generation overhead
- comparing against FP32 unquantized keys is not the same as comparing against an optimized FP16/BF16 dense path
- H100/JAX-style specialized kernels are not the same environment as a general local
transformersfork
So this work does not prove:
TurboQuant is wrong.
It proves:
my packed-codebook TurboQuant-family path did not have enough room to become a full decode-latency win in this repo.
That distinction is important.
It is also why the failure was useful.
It stopped me from turning a near-miss primitive into months of residual/value/model integration work.
The reusable lesson
The main lesson from this stage was:
Beating your own unoptimized implementation is not enough.
A compressed path has to beat the real target baseline.
For GPU decode, the real target baseline is dense optimized attention/logits, not the first eager prototype.
The second lesson:
A logits-only win needs enough margin to pay for the rest of attention.
If the best logits-only result barely reaches the required threshold at one context and misses at another, full attention is not going to improve the situation for free.
Where this left the project
After this stage, the packed-codebook TurboQuant latency path was closed.
Not because compression was fake.
Not because the math was useless.
Because the compressed representation was not cheap enough to consume compared with dense FP16/BF16 GPU math.
That left two possible pivots:
- Try a more hardware-friendly KV representation.
- Stop compressing values and instead reduce dense attention work.
Those became the final experiments.
They also had to pass the same discipline: bytes, speed ceilings, and quality are separate gates.
What I learned
If you only remember one thing from this post, it should be this:
A compressed kernel has to beat the real dense baseline, not just the unoptimized compressed prototype.
The fused logits kernel was useful because it answered that question before I spent time on residual correction, value accumulation, and model integration.

Top comments (0)