Alankrit Verma

Posted on Apr 27

The Last Pivot: Why Quality Gates Killed My Final KV-Cache Speedup

#machinelearning #ai #research #benchmarking

I wanted to answer one question:

After packed-codebook TurboQuant failed, was there still a credible latency path?

The short answer:

there was a real speed ceiling, but no stable quality-preserving implementation path.

TL;DR

Hardware-friendly int4 K/V passed byte gates but failed real-KV logit quality.
Qwen2.5-7B work reduction had a real speed ceiling: p_attn=0.334, with 1.20x to 1.21x projected speedup at 5% selector overhead.
Oracle quality failed anyway: no implementable selector passed all 4 decode steps.
The lesson was strict: a speed ceiling is only permission to run a quality gate, not permission to implement.

Evidence

I put the detailed benchmark notes in the public evidence repo:

Results ledger: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md
Hardware-friendly int4 K/V quality probe: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/hfkv-quality-k0-prep-summary.md
Work-reduction speed and oracle-quality probe: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/work-reduction-oracle-summary.md

The rule after the pivot

At this point, another TurboQuant variant would have been circular.

So the rule changed:

no implementation before a speed ceiling and an oracle-quality gate pass on the same target.

That rule matters because each partial result can otherwise be overinterpreted:

byte compression can pass while quality fails
synthetic quality can pass while real-KV quality fails
attention-only speed can pass while full decode speed cannot move enough
a speed ceiling can pass while no stable selector exists
a row-level oracle pass can hide step-to-step instability

For me, this became the gate-discipline part of the series: deciding when not to build.

The setup:

storage compression had already separated itself from latency
eager value-path approximations had failed
fused packed-codebook logits had not beaten dense logits by enough

By this point, the original packed-codebook TurboQuant latency path was closed.

The evidence was not subtle:

eager value-path variants failed
exact cleanup did not move latency enough
primitive feasibility failed badly
fused packed-codebook logits beat eager but missed dense-speed bars

So the next move could not be:

one more TurboQuant variant

It had to change the hypothesis.

I tested two pivots:

hardware-friendly int4 K/V
long-context work reduction

Both taught useful things.

Neither justified a runtime latency implementation.

Pivot 1: hardware-friendly int4 K/V

The packed-codebook representation was expensive to consume.

So the next idea was simpler:

What if the representation is less clever but more hardware-friendly?

Instead of codebooks, rotations, and residual machinery, use blockwise int4 K/V.

Two formats were tested:

symmetric int4
affine int4

Both quantized over last-dimension blocks with block_size=32.

The hope was:

simpler unpack/dequant path
predictable memory layout
fewer exotic operations
easier future kernel

This was not integrated into the public cache API.

It was a quality and K0-prep probe only.

The rule was strict:

if real model K/V quality fails, do not write the kernel.

HFKV passed bytes and failed quality

On the real-KV check with HuggingFaceTB/SmolLM2-135M-Instruct, both formats compressed KV substantially.

They also preserved next-token argmax.

But both failed the hard decode-logit MSE gate.

Format	Top-k@10	Argmax	Decode-Logit MSE	Required MSE	KV Ratio	Gate
symmetric int4	`0.800`	yes	`1.739284`	`<=0.25`	`3.56x`	fail
affine int4	`0.800`	yes	`1.360282`	`<=0.25`	`3.20x`	fail

The tempting interpretation would be:

argmax survived, so maybe this is fine.

That is not a good enough bar.

Top-k overlap only exactly hit the minimum, and logit MSE was far outside the target.

The correct decision was:

do not build HFKV-K0.

The reusable lesson:

byte compression is not quality.

The synthetic K0-prep numbers looked fine, but synthetic random tensors were not predictive enough. Real model K/V was the gate, and it failed.

Pivot 2: work reduction

After that, I stopped asking:

can I compress the historical values?

and asked:

does the model actually need all historical tokens for stable decode?

This is a different latency hypothesis.

It is not cache compression.

It is dense-attention work reduction.

The idea is:

full attention over history

becomes:

attention over a selected subset of history

If only a fraction f of historical tokens are active, the idealized attention work shrinks.

But this only matters if attention is a large enough part of full decode.

So the first gate was a speed ceiling.

Speed ceiling math

Let:

p_attn be the fraction of decode time spent in attention
f be the active historical fraction
h be selector/masking overhead as a fraction of the original decode step

The rough projected speedup is:

speedup = 1 / ((1 - p_attn) + p_attn * f + h)

This equation is intentionally simple.

It asks:

even if the selector existed, is there enough attention work to remove?

On small models, the answer was mostly no.

For SmolLM2-135M, attention was not a large enough share of full decode. The quality signal was real, but the latency ceiling was too low.

So I moved to a larger real target:

Qwen/Qwen2.5-7B-Instruct

at roughly:

8192 prompt tokens

Qwen had a real speed ceiling

The Qwen2.5-7B speed-ceiling result was the strongest latency signal in the whole project.

In bfloat16, the dense decode step was about:

20.447 ms

The SDPA real-model projection estimated:

p_attn = 0.334

Projected speedups:

Active Fraction	Projected Speedup, 0% Overhead	Projected Speedup, 5% Overhead	Projected Speedup, 10% Overhead	Gate
`0.337`	`1.28x`	`1.21x`	`1.14x`	pass at 5%
`0.350`	`1.28x`	`1.20x`	`1.13x`	pass at 5%
`0.376`	`1.26x`	`1.19x`	`1.12x`	near miss

This was not a fake result.

There was real room.

But speed ceiling is only half the story.

The next question was:

can an implementable selector preserve quality while keeping only about 34-38% of history?

Oracle quality failed

The official quality gate used:

model: Qwen/Qwen2.5-7B-Instruct
context: 8192
decode steps: 4
dtype: bfloat16
active fraction gate: 0.376

The dense reference was finite on all steps.

So this was a valid quality run.

The headline:

15 selector-step rows passed,
but 0 selector configurations passed all 4 decode steps.

That distinction matters.

A row-level pass says:

this selector worked on this step.

An implementation pass needs:

this selector family worked consistently across decode steps.

No implementable selector did.

Selector	Active Hist Fraction	Passed Steps	Failed Steps	Max MSE	Main Failure
`global_block_mass:f=0.3760:b=16`	`0.374`	`0,1,3`	`2`	`0.585789`	step `2` MSE
`global_block_mass:f=0.3500:b=16`	`0.348`	`0,3`	`1,2`	`0.900147`	steps `1,2` MSE
`global_block_mass:f=0.3370:b=16`	`0.335`	`0,2,3`	`1`	`0.919634`	step `1` MSE
`recent_sink:sink=4:recent=3072`	`0.375`	none	`0,1,2,3`	`0.163465`	layer-local relative L2

The most tempting result was recent_sink:sink=4:recent=3072.

It had:

max decode-logit MSE 0.163465
top-k overlap at least 0.8
stable argmax

But it failed every step on layer-local median post-o_proj relative L2:

0.1394 > 0.10

At this point, the dangerous move would be:

relax the quality gate because the result is close.

That is how projects go in circles.

The gate existed before the result.

The result failed the gate.

So the decision was:

do not build the runtime selector.

What survived

The final result is not:

everything was useless.

The surviving lessons are more precise.

Dense GPU decode is a serious baseline

Dense attention is not naive.

It has:

simple data layout
optimized kernels
clean tensor operations
no unpack/reconstruct overhead

Any compressed path has to beat that, not just beat its own prototype.

Memory and latency are separate scorecards

Cache compression can be valuable even if it does not reduce latency.

The right memory/capacity metrics are:

cache bytes per token
maximum context before OOM
batch size at fixed VRAM
throughput under memory pressure
quality at long context

That is a different product goal from:

faster decode when dense already fits

Speed ceiling is necessary but not sufficient

Qwen2.5-7B proved there can be enough attention share for work reduction to matter.

But the selector also has to preserve quality.

The oracle could not find one stable implementable selector under the hard gate.

Paper claims and integration claims are different

A paper can make a valid primitive or memory claim.

A transformers integration needs a different proof:

full decode timing
real model quality
update cost
value path
generation overhead
target hardware
target dtype baseline

Those are not interchangeable.

Final state

The current honest state is:

No active GPU decode-latency implementation path.

Closed as latency paths:

eager TurboQuant-family variants
packed-codebook fused K1/residual/value integration
hardware-friendly int4 K/V kernel work
Qwen work-reduction selector implementation

Still potentially useful:

storage/capacity compression
exact cleanup as baseline hygiene
the measurement discipline
the failure map

The next reasonable project, if the goal continues, is not:

one more latency variant

It is:

a separate memory/capacity plan

with its own scorecard.

Closing

The whole project started with a simple hope:

smaller KV cache, faster transformers.

The actual lesson was harder:

compression only speeds up decode if the compressed representation is cheap to consume on the target execution path.

That condition failed repeatedly:

in eager value approximations
in packed-codebook primitive timing
in fused logits upper bounds
in simple int4 K/V quality
in long-context work reduction quality

That is not wasted work.

It is a map of where the obvious traps are.

And for performance engineering, a good negative map is often the thing that prevents the next six months of bad work.

The final lesson:

A speed ceiling is only permission to run a quality gate. It is not permission to implement.

Qwen2.5-7B had enough attention share for work reduction to matter. The oracle still failed to find one stable implementable selector. That is why the latency path stopped.

DEV Community