Alankrit Verma

Posted on Apr 26

When A Good Approximation Still Loses

#ai #machinelearning #performance #research

I wanted to answer one question:

Why did a mathematically reasonable value approximation still fail as a runtime optimization?

The short answer:

active fraction is not runtime.

TL;DR

I tried to make historical value mixing cheaper after compressed-key attention was stable.
Chunk summaries were cheap because they removed information; active chunks had a plausible error argument but terrible eager runtime.
The final vectorized eager path kept active fraction low, but value decode got worse: 11.906 ms became about 26 ms.
The lesson was concrete: active token fraction is not runtime. Count gathers, decodes, reductions, scatters, and bookkeeping.

Evidence

I put the detailed benchmark notes in the public evidence repo:

Results ledger: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md
Active-chunk postmortem: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/active-chunk-background-postmortem.md
Final vectorized eager spike summary: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/spike-a-vectorized-active-background-summary.md

My value-side bet

My working hypothesis was not:

randomly approximate values and hope generation survives.

It was more structured:

Keep the compressed-key path stable.
Preserve recent and sink values exactly because they are high-risk.
Summarize low-risk historical values only if the quality loss is bounded.
Select active historical chunks only when the approximation risk is high.
Require the implementation to reduce real hot-path work, not just theoretical token count.

The mistake was that step 5 was weaker than steps 1-4 for too long.

That is why this post is useful beyond this specific cache implementation. It is a case study in the difference between:

a reasonable approximation argument

and:

a runtime shape that the GPU and framework can execute cheaply.

The setup:

a smaller KV cache is not automatically a faster attention path
compressed keys can reduce part of the problem
values still have to be mixed across history
that value path became the bottleneck

The earlier architecture lesson was:

A smaller KV cache is not automatically a faster attention path.

That pushed me toward compressed-attention execution instead of storage-only cache compression.

I built a stable compressed-key baseline. It was not fast enough to be the final answer, but it was coherent. It gave me a way to separate the key side from the value side.

This distinction matters for the rest of the post:

The experiments below are not a verdict on the official TurboQuant paper or every possible fused implementation. They are a verdict on the eager value-path family I built in this transformers fork.

Then the real problem became clear:

even with compressed keys, the model still has to mix historical values.

The rest of this piece follows the experiments that tried to make that value path cheaper.

None became the final answer.

But each one taught something useful.

The value-side problem

After attention weights are computed, the model still needs:

o = sum_t a_t v_t

The expensive part is that t ranges over the history.

If the implementation still decodes or processes most historical values every step, then it has not really escaped the long-context cost.

So the value-path goal was:

keep enough value information for quality, but stop paying full historical value cost everywhere.

The hard part is that "enough" is query-dependent, head-dependent, and quality-sensitive.

What survived early: exact recent and sink values

One value-side ingredient consistently made sense:

keep a small exact window for the most recent tokens and a few sink tokens.

Internally this was called exact_recent_sink. The reader-facing name is:

exact recent/sink value window

The reason is simple:

recent tokens often matter a lot
sink tokens can have special attention behavior
preserving them exactly helps stability

This was not a speed breakthrough in eager execution, but it was the only value-side ingredient that kept looking defensible.

What failed early: more hybrid logic

The first broad hybrid value experiment mixed several ideas:

exact recent/sink windows
delayed value quantization
saliency bookkeeping
anchors
selective residual logic

Internally this was legacy_hybrid_fast.

It lost.

It added complexity without removing enough dominant work. It was slower than the stable compressed-key baseline and not cleanly better on fidelity.

The reusable lesson:

hybrid logic must remove dominant work, not just add selective corrections.

I also tested saliency-heavy background variants and anchors.

Those did not become forward paths either.

The pattern was the same:

more bookkeeping
more branches
unclear runtime win
not enough quality/runtime payoff

These branches are part of the work, but they do not each need their own long section. Their shared lesson was the same, and the later chunk experiments explain it more cleanly.

Chunk summaries were fast because they were wrong

The next idea was to summarize historical values by chunks.

For each chunk C_j, define a mean value:

mu_j = (1 / c) sum_(t in C_j) v_t

where c is the chunk size.

And define the attention mass on that chunk:

M_j = sum_(t in C_j) a_t

Then approximate the chunk contribution as:

M_j mu_j

instead of:

sum_(t in C_j) a_t v_t

This is attractive because it replaces many token-level value contributions with one chunk-level contribution.

But the blanket summary version lost too much information.

It could be faster, but it was fast for the wrong reason: it threw away too much of the task.

Reusable lesson:

a speedup that comes from destroying quality is not an optimization.

Active chunks had better math and terrible runtime

The next idea tried to keep the good part of chunk summaries without summarizing everything.

Instead of treating every historical chunk the same:

keep important chunks active and exact
summarize the inactive chunks

For inactive chunks:

o_j_hat = M_j mu_j

For active chunks:

o_j = sum_(t in C_j) a_t v_t

To choose active chunks, I used a score:

score_j = mass_j * spread_j

The intuition was reasonable:

high attention mass means the chunk matters
high value spread means the mean may be a bad summary
high mass * spread means the chunk is risky to approximate

This was a real approximation argument.

It was not a runtime argument.

The eager implementation was disastrous.

Path	Decode 2048	Memory Context 2048	Long Next-Token MSE
stable compressed-key baseline	`54.607 ms`	`13.717 s`	`0.349`
exact recent/sink window	`70.205 ms`	`14.485 s`	`0.326`
active-chunk value approximation	`1148.843 ms`	`311.692 s`	`0.494`

The profile explained the failure.

The active-chunk implementation:

selected active chunks
looped over those chunks in Python
decoded tiny slices repeatedly
scattered small contributions back repeatedly

At decode 2048, value decode time went from:

9.080 ms -> 1079.421 ms

That is not a small miss. That is an implementation shape failure.

The major lesson:

I had proved an approximation bound, not a runtime win.

The process lesson was just as important:

Before a large run, every approximation needs a written hot-path shape: how many gathers, decodes, reductions, scatters, and Python-level loops are actually in the decode step.

The final eager test: vectorize active chunks

After that failure, the next question was narrow:

Was the active-chunk idea bad, or was the eager implementation shape bad?

So I ran one final eager viability test.

Internal name:

vectorized_active_background

Reader-facing name:

final vectorized active-background value test

The goal was to keep the same compressed-key baseline and redesign only historical value participation.

The design used:

exact sink tokens
exact recent tokens
a dense staging buffer for values that had aged out of the recent window
fixed-size historical chunks
active exact chunks selected by attention mass and spread
inactive chunk summaries
vectorized gather/decode/reduction instead of Python loops

The intended output decomposition was:

o = o_sink + o_recent + o_staging + o_history

For historical chunks:

o_history ~= sum_(j in A) sum_(t in C_j) a_t v_t
           + sum_(j not in A) M_j mu_j

where A is the active chunk set.

This test had a hard benchmark gate.

It needed:

at least 20% better decode step latency than the stable compressed-key baseline
at least 40% lower value-decode time
active token fraction at most 0.25
at least 20% better long-context latency
no material fidelity collapse

The benchmark grid was intentionally small:

Reader Label	Summary Chunk Size	Active Chunk Ratio	Min Active Chunks
chunk 16, 12.5% active	`16`	`0.125`	`1`
chunk 16, 25% active	`16`	`0.25`	`1`
chunk 32, 12.5% active	`32`	`0.125`	`1`

This was not a sweep for the sake of sweeping. It was a falsification test.

The result: active fraction was not runtime

The stable compressed-key baseline for this run was:

Metric	Baseline
Decode `2048` step latency	`70.604 ms`
Decode `2048` value-decode time	`11.906 ms`
Prefill `2048` latency	`58.128 ms`
Long-context latency	`14.212 s`
Fidelity MSE vs dense baseline	`0.2326`
Fidelity top-k overlap	`1.0`

The candidates:

Reader Label	Decode Step	Decode vs Baseline	Value Decode	Active Token Fraction	Long-Context Latency	Fidelity MSE	Top-k
chunk 16, 12.5% active	`68.726 ms`	`+2.7%`	`26.461 ms`	`0.125`	`17.756 s`	`1.3115`	`0.5`
chunk 16, 25% active	`68.768 ms`	`+2.6%`	`26.149 ms`	`0.25`	`17.884 s`	`0.3314`	`0.7`
chunk 32, 12.5% active	`68.386 ms`	`+3.1%`	`26.341 ms`	`0.1333`	`17.780 s`	`2.1598`	`0.4`

No configuration passed.

The active token fraction stayed within budget.

But the actual value-decode time got much worse:

11.906 ms -> about 26 ms

Long-context latency got about 25% worse.

Fidelity regressed.

The core lesson:

active token fraction is not runtime.

The runtime is closer to:

total_cost = selection
           + gather
           + active_decode
           + active_reduce
           + inactive_reduce
           + bookkeeping

not:

total_cost = active_fraction * dense_cost

The final vectorized eager path removed the catastrophic Python loop, but it still did not produce a frontier-moving result.

So the disciplined decision was:

freeze eager variants in this compressed-attention family.

The test sequence was deliberately not broad:

focused correctness tests
decode at roughly 2048 prompt tokens
prefill at roughly 2048 prompt tokens
long-context memory/latency at roughly 2048 prompt tokens
next-token logit fidelity
profile counters for active gather/decode/reduce, inactive summary work, and staging flushes

That matters because this was a kill-gate, not a hyperparameter search.

What this does and does not prove

It does prove:

this eager value-path family should not keep getting new variants
vectorizing a bad workload is not automatically enough
runtime-cost modeling must happen before broad benchmarks
quality gates matter as much as speed gates

It does not prove:

all compressed attention is useless
TurboQuant-style key ideas are worthless
lower-level fused implementations cannot work
that the official TurboQuant result is invalid

The correct conclusion is narrower:

the eager implementation level failed for this family.

That is still a useful result. It prevents the next round of work from being "one more eager variant with one more heuristic."

What came next

At this point, the eager value-path family was frozen.

But that did not yet answer a lower-level question:

Was the compressed-attention idea weak, or was the eager implementation level weak?

Before doing kernel work, there was one exact cleanup worth proving.

The stable compressed-key baseline currently reconstructs rotated value information in a way that appears algebraically wasteful.

Current form:

decoded_values_t = R^-1(z_t)
o = sum_t a_t decoded_values_t

Because R^-1 is linear:

sum_t a_t R^-1(z_t) = R^-1(sum_t a_t z_t)

So instead of inverse-rotating every historical token value, I can first compute the weighted sum in rotated space and inverse-rotate once:

z_weighted = sum_t a_t z_t
o = R^-1(z_weighted)

This is exact relative to the current value codec.

It is not:

a new heuristic
a new approximation
another active-background branch

It is a baseline cleanup.

For SmolLM2-135M at T = 2048, the rough inverse-rotation count changes from:

H_kv * T = 3 * 2048 = 6144

to:

H_kv * G * Q = 3 * 3 * 1 = 9

for the output rotations.

Here H_kv is the number of KV heads, G is the number of query groups per KV head, and Q is the number of query positions in a decode step.

That is why it deserved a quick proof.

If that exact cleanup did not move the profile enough, then the only remaining TurboQuant-family question was lower-level:

can a fused/Triton/CUDA-style compressed-attention primitive beat dense attention by a large enough margin to justify integration?

That is a different proof level. It is no longer a cache-API experiment; it is a primitive benchmark against dense GPU attention/logits.

The lessons worth keeping

Storage compression and runtime execution must be measured separately.
A stable baseline is more valuable than a pile of incomparable branches.
Every approximation needs a runtime proof obligation.
Python loops in decode hot paths are presumed guilty until proven otherwise.
Active fraction, cache size, and runtime are different metrics.
A failed eager implementation does not automatically disprove the math.
A failed hard gate should stop the branch, not invite endless tuning.
The next proof should be exact if possible, primitive-level if necessary, and killed quickly if weak.
Benchmark-only experiment names should stay separate from public API promises.
Every serious run should emit enough configuration and profiling data to explain the result later.

Closing

This value-path work did not produce a production-ready faster cache path.

But it did produce a useful map:

exact recent/sink values were defensible but not enough
summary-only background was fast because it lost too much information
active chunks had a reasonable approximation argument and a bad hot path
vectorization removed the catastrophic loop shape but still missed the gate
active fraction, cache size, and runtime were different metrics

That is the part worth keeping.

Not because the final result won. It did not.

Because now I know what not to try again at the eager value-path level.

The remaining question was lower-level:

What if the compressed path only failed because eager execution was the wrong implementation level?

That question needed a different kind of proof: a primitive-level comparison against dense GPU attention/logits, not another eager cache variant.

DEV Community