I wanted to answer one question:
After packed-codebook TurboQuant failed, was there still a credible latency path?
The short answer:
there was a real speed ceiling, but no stable quality-preserving implementation path.
TL;DR
- Hardware-friendly int4 K/V passed byte gates but failed real-KV logit quality.
- Qwen2.5-7B work reduction had a real speed ceiling:
p_attn=0.334, with1.20xto1.21xprojected speedup at 5% selector overhead. - Oracle quality failed anyway: no implementable selector passed all 4 decode steps.
- The lesson was strict: a speed ceiling is only permission to run a quality gate, not permission to implement.
Evidence
I put the detailed benchmark notes in the public evidence repo:
- Results ledger: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md
- Hardware-friendly int4 K/V quality probe: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/hfkv-quality-k0-prep-summary.md
- Work-reduction speed and oracle-quality probe: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/work-reduction-oracle-summary.md
The rule after the pivot
At this point, another TurboQuant variant would have been circular.
So the rule changed:
no implementation before a speed ceiling and an oracle-quality gate pass on the same target.
That rule matters because each partial result can otherwise be overinterpreted:
- byte compression can pass while quality fails
- synthetic quality can pass while real-KV quality fails
- attention-only speed can pass while full decode speed cannot move enough
- a speed ceiling can pass while no stable selector exists
- a row-level oracle pass can hide step-to-step instability
For me, this became the gate-discipline part of the series: deciding when not to build.
The setup:
- storage compression had already separated itself from latency
- eager value-path approximations had failed
- fused packed-codebook logits had not beaten dense logits by enough
By this point, the original packed-codebook TurboQuant latency path was closed.
The evidence was not subtle:
- eager value-path variants failed
- exact cleanup did not move latency enough
- primitive feasibility failed badly
- fused packed-codebook logits beat eager but missed dense-speed bars
So the next move could not be:
one more TurboQuant variant
It had to change the hypothesis.
I tested two pivots:
- hardware-friendly int4 K/V
- long-context work reduction
Both taught useful things.
Neither justified a runtime latency implementation.
Pivot 1: hardware-friendly int4 K/V
The packed-codebook representation was expensive to consume.
So the next idea was simpler:
What if the representation is less clever but more hardware-friendly?
Instead of codebooks, rotations, and residual machinery, use blockwise int4 K/V.
Two formats were tested:
- symmetric int4
- affine int4
Both quantized over last-dimension blocks with block_size=32.
The hope was:
- simpler unpack/dequant path
- predictable memory layout
- fewer exotic operations
- easier future kernel
This was not integrated into the public cache API.
It was a quality and K0-prep probe only.
The rule was strict:
if real model K/V quality fails, do not write the kernel.
HFKV passed bytes and failed quality
On the real-KV check with HuggingFaceTB/SmolLM2-135M-Instruct, both formats compressed KV substantially.
They also preserved next-token argmax.
But both failed the hard decode-logit MSE gate.
| Format | Top-k@10 | Argmax | Decode-Logit MSE | Required MSE | KV Ratio | Gate |
|---|---|---|---|---|---|---|
| symmetric int4 | 0.800 |
yes | 1.739284 |
<=0.25 |
3.56x |
fail |
| affine int4 | 0.800 |
yes | 1.360282 |
<=0.25 |
3.20x |
fail |
The tempting interpretation would be:
argmax survived, so maybe this is fine.
That is not a good enough bar.
Top-k overlap only exactly hit the minimum, and logit MSE was far outside the target.
The correct decision was:
do not build HFKV-K0.
The reusable lesson:
byte compression is not quality.
The synthetic K0-prep numbers looked fine, but synthetic random tensors were not predictive enough. Real model K/V was the gate, and it failed.
Pivot 2: work reduction
After that, I stopped asking:
can I compress the historical values?
and asked:
does the model actually need all historical tokens for stable decode?
This is a different latency hypothesis.
It is not cache compression.
It is dense-attention work reduction.
The idea is:
full attention over history
becomes:
attention over a selected subset of history
If only a fraction f of historical tokens are active, the idealized attention work shrinks.
But this only matters if attention is a large enough part of full decode.
So the first gate was a speed ceiling.
Speed ceiling math
Let:
-
p_attnbe the fraction of decode time spent in attention -
fbe the active historical fraction -
hbe selector/masking overhead as a fraction of the original decode step
The rough projected speedup is:
speedup = 1 / ((1 - p_attn) + p_attn * f + h)
This equation is intentionally simple.
It asks:
even if the selector existed, is there enough attention work to remove?
On small models, the answer was mostly no.
For SmolLM2-135M, attention was not a large enough share of full decode. The quality signal was real, but the latency ceiling was too low.
So I moved to a larger real target:
Qwen/Qwen2.5-7B-Instruct
at roughly:
8192 prompt tokens
Qwen had a real speed ceiling
The Qwen2.5-7B speed-ceiling result was the strongest latency signal in the whole project.
In bfloat16, the dense decode step was about:
20.447 ms
The SDPA real-model projection estimated:
p_attn = 0.334
Projected speedups:
| Active Fraction | Projected Speedup, 0% Overhead | Projected Speedup, 5% Overhead | Projected Speedup, 10% Overhead | Gate |
|---|---|---|---|---|
0.337 |
1.28x |
1.21x |
1.14x |
pass at 5% |
0.350 |
1.28x |
1.20x |
1.13x |
pass at 5% |
0.376 |
1.26x |
1.19x |
1.12x |
near miss |
This was not a fake result.
There was real room.
But speed ceiling is only half the story.
The next question was:
can an implementable selector preserve quality while keeping only about 34-38% of history?
Oracle quality failed
The official quality gate used:
- model:
Qwen/Qwen2.5-7B-Instruct - context:
8192 - decode steps:
4 - dtype:
bfloat16 - active fraction gate:
0.376
The dense reference was finite on all steps.
So this was a valid quality run.
The headline:
15 selector-step rows passed,
but 0 selector configurations passed all 4 decode steps.
That distinction matters.
A row-level pass says:
this selector worked on this step.
An implementation pass needs:
this selector family worked consistently across decode steps.
No implementable selector did.
| Selector | Active Hist Fraction | Passed Steps | Failed Steps | Max MSE | Main Failure |
|---|---|---|---|---|---|
global_block_mass:f=0.3760:b=16 |
0.374 |
0,1,3 |
2 |
0.585789 |
step 2 MSE |
global_block_mass:f=0.3500:b=16 |
0.348 |
0,3 |
1,2 |
0.900147 |
steps 1,2 MSE |
global_block_mass:f=0.3370:b=16 |
0.335 |
0,2,3 |
1 |
0.919634 |
step 1 MSE |
recent_sink:sink=4:recent=3072 |
0.375 |
none | 0,1,2,3 |
0.163465 |
layer-local relative L2 |
The most tempting result was recent_sink:sink=4:recent=3072.
It had:
- max decode-logit MSE
0.163465 - top-k overlap at least
0.8 - stable argmax
But it failed every step on layer-local median post-o_proj relative L2:
0.1394 > 0.10
At this point, the dangerous move would be:
relax the quality gate because the result is close.
That is how projects go in circles.
The gate existed before the result.
The result failed the gate.
So the decision was:
do not build the runtime selector.
What survived
The final result is not:
everything was useless.
The surviving lessons are more precise.
Dense GPU decode is a serious baseline
Dense attention is not naive.
It has:
- simple data layout
- optimized kernels
- clean tensor operations
- no unpack/reconstruct overhead
Any compressed path has to beat that, not just beat its own prototype.
Memory and latency are separate scorecards
Cache compression can be valuable even if it does not reduce latency.
The right memory/capacity metrics are:
- cache bytes per token
- maximum context before OOM
- batch size at fixed VRAM
- throughput under memory pressure
- quality at long context
That is a different product goal from:
faster decode when dense already fits
Speed ceiling is necessary but not sufficient
Qwen2.5-7B proved there can be enough attention share for work reduction to matter.
But the selector also has to preserve quality.
The oracle could not find one stable implementable selector under the hard gate.
Paper claims and integration claims are different
A paper can make a valid primitive or memory claim.
A transformers integration needs a different proof:
- full decode timing
- real model quality
- update cost
- value path
- generation overhead
- target hardware
- target dtype baseline
Those are not interchangeable.
Final state
The current honest state is:
No active GPU decode-latency implementation path.
Closed as latency paths:
- eager TurboQuant-family variants
- packed-codebook fused K1/residual/value integration
- hardware-friendly int4 K/V kernel work
- Qwen work-reduction selector implementation
Still potentially useful:
- storage/capacity compression
- exact cleanup as baseline hygiene
- the measurement discipline
- the failure map
The next reasonable project, if the goal continues, is not:
one more latency variant
It is:
a separate memory/capacity plan
with its own scorecard.
Closing
The whole project started with a simple hope:
smaller KV cache, faster transformers.
The actual lesson was harder:
compression only speeds up decode if the compressed representation is cheap to consume on the target execution path.
That condition failed repeatedly:
- in eager value approximations
- in packed-codebook primitive timing
- in fused logits upper bounds
- in simple int4 K/V quality
- in long-context work reduction quality
That is not wasted work.
It is a map of where the obvious traps are.
And for performance engineering, a good negative map is often the thing that prevents the next six months of bad work.
The final lesson:
A speed ceiling is only permission to run a quality gate. It is not permission to implement.
Qwen2.5-7B had enough attention share for work reduction to matter. The oracle still failed to find one stable implementable selector. That is why the latency path stopped.

Top comments (0)