DEV Community

Cover image for When A Good Approximation Still Loses
Alankrit Verma
Alankrit Verma

Posted on

When A Good Approximation Still Loses

I wanted to answer one question:

Why did a mathematically reasonable value approximation still fail as a runtime optimization?

The short answer:

active fraction is not runtime.

TL;DR

  • I tried to make historical value mixing cheaper after compressed-key attention was stable.
  • Chunk summaries were cheap because they removed information; active chunks had a plausible error argument but terrible eager runtime.
  • The final vectorized eager path kept active fraction low, but value decode got worse: 11.906 ms became about 26 ms.
  • The lesson was concrete: active token fraction is not runtime. Count gathers, decodes, reductions, scatters, and bookkeeping.

Evidence

I put the detailed benchmark notes in the public evidence repo:

My value-side bet

My working hypothesis was not:

randomly approximate values and hope generation survives.

It was more structured:

  1. Keep the compressed-key path stable.
  2. Preserve recent and sink values exactly because they are high-risk.
  3. Summarize low-risk historical values only if the quality loss is bounded.
  4. Select active historical chunks only when the approximation risk is high.
  5. Require the implementation to reduce real hot-path work, not just theoretical token count.

The mistake was that step 5 was weaker than steps 1-4 for too long.

That is why this post is useful beyond this specific cache implementation. It is a case study in the difference between:

a reasonable approximation argument

and:

a runtime shape that the GPU and framework can execute cheaply.

The setup:

  • a smaller KV cache is not automatically a faster attention path
  • compressed keys can reduce part of the problem
  • values still have to be mixed across history
  • that value path became the bottleneck

The earlier architecture lesson was:

A smaller KV cache is not automatically a faster attention path.

That pushed me toward compressed-attention execution instead of storage-only cache compression.

I built a stable compressed-key baseline. It was not fast enough to be the final answer, but it was coherent. It gave me a way to separate the key side from the value side.

This distinction matters for the rest of the post:

The experiments below are not a verdict on the official TurboQuant paper or every possible fused implementation. They are a verdict on the eager value-path family I built in this transformers fork.

Then the real problem became clear:

even with compressed keys, the model still has to mix historical values.

The rest of this piece follows the experiments that tried to make that value path cheaper.

None became the final answer.

But each one taught something useful.

The value-side problem

After attention weights are computed, the model still needs:

o = sum_t a_t v_t
Enter fullscreen mode Exit fullscreen mode

The expensive part is that t ranges over the history.

If the implementation still decodes or processes most historical values every step, then it has not really escaped the long-context cost.

So the value-path goal was:

keep enough value information for quality, but stop paying full historical value cost everywhere.

The hard part is that "enough" is query-dependent, head-dependent, and quality-sensitive.

What survived early: exact recent and sink values

One value-side ingredient consistently made sense:

keep a small exact window for the most recent tokens and a few sink tokens.

Internally this was called exact_recent_sink. The reader-facing name is:

exact recent/sink value window

The reason is simple:

  • recent tokens often matter a lot
  • sink tokens can have special attention behavior
  • preserving them exactly helps stability

This was not a speed breakthrough in eager execution, but it was the only value-side ingredient that kept looking defensible.

What failed early: more hybrid logic

The first broad hybrid value experiment mixed several ideas:

  • exact recent/sink windows
  • delayed value quantization
  • saliency bookkeeping
  • anchors
  • selective residual logic

Internally this was legacy_hybrid_fast.

It lost.

It added complexity without removing enough dominant work. It was slower than the stable compressed-key baseline and not cleanly better on fidelity.

The reusable lesson:

hybrid logic must remove dominant work, not just add selective corrections.

I also tested saliency-heavy background variants and anchors.

Those did not become forward paths either.

The pattern was the same:

  • more bookkeeping
  • more branches
  • unclear runtime win
  • not enough quality/runtime payoff

These branches are part of the work, but they do not each need their own long section. Their shared lesson was the same, and the later chunk experiments explain it more cleanly.

Chunk summaries were fast because they were wrong

The next idea was to summarize historical values by chunks.

For each chunk C_j, define a mean value:

mu_j = (1 / c) sum_(t in C_j) v_t
Enter fullscreen mode Exit fullscreen mode

where c is the chunk size.

And define the attention mass on that chunk:

M_j = sum_(t in C_j) a_t
Enter fullscreen mode Exit fullscreen mode

Then approximate the chunk contribution as:

M_j mu_j
Enter fullscreen mode Exit fullscreen mode

instead of:

sum_(t in C_j) a_t v_t
Enter fullscreen mode Exit fullscreen mode

This is attractive because it replaces many token-level value contributions with one chunk-level contribution.

But the blanket summary version lost too much information.

It could be faster, but it was fast for the wrong reason: it threw away too much of the task.

Reusable lesson:

a speedup that comes from destroying quality is not an optimization.

Active chunks had better math and terrible runtime

The next idea tried to keep the good part of chunk summaries without summarizing everything.

Instead of treating every historical chunk the same:

  • keep important chunks active and exact
  • summarize the inactive chunks

For inactive chunks:

o_j_hat = M_j mu_j
Enter fullscreen mode Exit fullscreen mode

For active chunks:

o_j = sum_(t in C_j) a_t v_t
Enter fullscreen mode Exit fullscreen mode

To choose active chunks, I used a score:

score_j = mass_j * spread_j
Enter fullscreen mode Exit fullscreen mode

The intuition was reasonable:

  • high attention mass means the chunk matters
  • high value spread means the mean may be a bad summary
  • high mass * spread means the chunk is risky to approximate

This was a real approximation argument.

It was not a runtime argument.

The eager implementation was disastrous.

Path Decode 2048 Memory Context 2048 Long Next-Token MSE
stable compressed-key baseline 54.607 ms 13.717 s 0.349
exact recent/sink window 70.205 ms 14.485 s 0.326
active-chunk value approximation 1148.843 ms 311.692 s 0.494

Active chunk approximation runtime blow-up

The profile explained the failure.

The active-chunk implementation:

  • selected active chunks
  • looped over those chunks in Python
  • decoded tiny slices repeatedly
  • scattered small contributions back repeatedly

At decode 2048, value decode time went from:

9.080 ms -> 1079.421 ms
Enter fullscreen mode Exit fullscreen mode

That is not a small miss. That is an implementation shape failure.

The major lesson:

I had proved an approximation bound, not a runtime win.

The process lesson was just as important:

Before a large run, every approximation needs a written hot-path shape: how many gathers, decodes, reductions, scatters, and Python-level loops are actually in the decode step.

Experiment timeline

The final eager test: vectorize active chunks

After that failure, the next question was narrow:

Was the active-chunk idea bad, or was the eager implementation shape bad?

So I ran one final eager viability test.

Internal name:

vectorized_active_background
Enter fullscreen mode Exit fullscreen mode

Reader-facing name:

final vectorized active-background value test

The goal was to keep the same compressed-key baseline and redesign only historical value participation.

The design used:

  • exact sink tokens
  • exact recent tokens
  • a dense staging buffer for values that had aged out of the recent window
  • fixed-size historical chunks
  • active exact chunks selected by attention mass and spread
  • inactive chunk summaries
  • vectorized gather/decode/reduction instead of Python loops

The intended output decomposition was:

o = o_sink + o_recent + o_staging + o_history
Enter fullscreen mode Exit fullscreen mode

For historical chunks:

o_history ~= sum_(j in A) sum_(t in C_j) a_t v_t
           + sum_(j not in A) M_j mu_j
Enter fullscreen mode Exit fullscreen mode

where A is the active chunk set.

This test had a hard benchmark gate.

It needed:

  • at least 20% better decode step latency than the stable compressed-key baseline
  • at least 40% lower value-decode time
  • active token fraction at most 0.25
  • at least 20% better long-context latency
  • no material fidelity collapse

The benchmark grid was intentionally small:

Reader Label Summary Chunk Size Active Chunk Ratio Min Active Chunks
chunk 16, 12.5% active 16 0.125 1
chunk 16, 25% active 16 0.25 1
chunk 32, 12.5% active 32 0.125 1

This was not a sweep for the sake of sweeping. It was a falsification test.

The result: active fraction was not runtime

The stable compressed-key baseline for this run was:

Metric Baseline
Decode 2048 step latency 70.604 ms
Decode 2048 value-decode time 11.906 ms
Prefill 2048 latency 58.128 ms
Long-context latency 14.212 s
Fidelity MSE vs dense baseline 0.2326
Fidelity top-k overlap 1.0

The candidates:

Reader Label Decode Step Decode vs Baseline Value Decode Active Token Fraction Long-Context Latency Fidelity MSE Top-k
chunk 16, 12.5% active 68.726 ms +2.7% 26.461 ms 0.125 17.756 s 1.3115 0.5
chunk 16, 25% active 68.768 ms +2.6% 26.149 ms 0.25 17.884 s 0.3314 0.7
chunk 32, 12.5% active 68.386 ms +3.1% 26.341 ms 0.1333 17.780 s 2.1598 0.4

No configuration passed.

The active token fraction stayed within budget.

But the actual value-decode time got much worse:

11.906 ms -> about 26 ms
Enter fullscreen mode Exit fullscreen mode

Long-context latency got about 25% worse.

Fidelity regressed.

Final eager gate results

The core lesson:

active token fraction is not runtime.

The runtime is closer to:

total_cost = selection
           + gather
           + active_decode
           + active_reduce
           + inactive_reduce
           + bookkeeping
Enter fullscreen mode Exit fullscreen mode

not:

total_cost = active_fraction * dense_cost
Enter fullscreen mode Exit fullscreen mode

The final vectorized eager path removed the catastrophic Python loop, but it still did not produce a frontier-moving result.

So the disciplined decision was:

freeze eager variants in this compressed-attention family.

The test sequence was deliberately not broad:

  • focused correctness tests
  • decode at roughly 2048 prompt tokens
  • prefill at roughly 2048 prompt tokens
  • long-context memory/latency at roughly 2048 prompt tokens
  • next-token logit fidelity
  • profile counters for active gather/decode/reduce, inactive summary work, and staging flushes

That matters because this was a kill-gate, not a hyperparameter search.

What this does and does not prove

It does prove:

  • this eager value-path family should not keep getting new variants
  • vectorizing a bad workload is not automatically enough
  • runtime-cost modeling must happen before broad benchmarks
  • quality gates matter as much as speed gates

It does not prove:

  • all compressed attention is useless
  • TurboQuant-style key ideas are worthless
  • lower-level fused implementations cannot work
  • that the official TurboQuant result is invalid

The correct conclusion is narrower:

the eager implementation level failed for this family.

That is still a useful result. It prevents the next round of work from being "one more eager variant with one more heuristic."

What came next

At this point, the eager value-path family was frozen.

But that did not yet answer a lower-level question:

Was the compressed-attention idea weak, or was the eager implementation level weak?

Before doing kernel work, there was one exact cleanup worth proving.

The stable compressed-key baseline currently reconstructs rotated value information in a way that appears algebraically wasteful.

Current form:

decoded_values_t = R^-1(z_t)
o = sum_t a_t decoded_values_t
Enter fullscreen mode Exit fullscreen mode

Because R^-1 is linear:

sum_t a_t R^-1(z_t) = R^-1(sum_t a_t z_t)
Enter fullscreen mode Exit fullscreen mode

So instead of inverse-rotating every historical token value, I can first compute the weighted sum in rotated space and inverse-rotate once:

z_weighted = sum_t a_t z_t
o = R^-1(z_weighted)
Enter fullscreen mode Exit fullscreen mode

This is exact relative to the current value codec.

It is not:

  • a new heuristic
  • a new approximation
  • another active-background branch

It is a baseline cleanup.

For SmolLM2-135M at T = 2048, the rough inverse-rotation count changes from:

H_kv * T = 3 * 2048 = 6144
Enter fullscreen mode Exit fullscreen mode

to:

H_kv * G * Q = 3 * 3 * 1 = 9
Enter fullscreen mode Exit fullscreen mode

for the output rotations.

Here H_kv is the number of KV heads, G is the number of query groups per KV head, and Q is the number of query positions in a decode step.

That is why it deserved a quick proof.

If that exact cleanup did not move the profile enough, then the only remaining TurboQuant-family question was lower-level:

can a fused/Triton/CUDA-style compressed-attention primitive beat dense attention by a large enough margin to justify integration?

That is a different proof level. It is no longer a cache-API experiment; it is a primitive benchmark against dense GPU attention/logits.

The lessons worth keeping

  1. Storage compression and runtime execution must be measured separately.
  2. A stable baseline is more valuable than a pile of incomparable branches.
  3. Every approximation needs a runtime proof obligation.
  4. Python loops in decode hot paths are presumed guilty until proven otherwise.
  5. Active fraction, cache size, and runtime are different metrics.
  6. A failed eager implementation does not automatically disprove the math.
  7. A failed hard gate should stop the branch, not invite endless tuning.
  8. The next proof should be exact if possible, primitive-level if necessary, and killed quickly if weak.
  9. Benchmark-only experiment names should stay separate from public API promises.
  10. Every serious run should emit enough configuration and profiling data to explain the result later.

Closing

This value-path work did not produce a production-ready faster cache path.

But it did produce a useful map:

  • exact recent/sink values were defensible but not enough
  • summary-only background was fast because it lost too much information
  • active chunks had a reasonable approximation argument and a bad hot path
  • vectorization removed the catastrophic loop shape but still missed the gate
  • active fraction, cache size, and runtime were different metrics

That is the part worth keeping.

Not because the final result won. It did not.

Because now I know what not to try again at the eager value-path level.

The remaining question was lower-level:

What if the compressed path only failed because eager execution was the wrong implementation level?

That question needed a different kind of proof: a primitive-level comparison against dense GPU attention/logits, not another eager cache variant.

Top comments (0)