DEV Community

jidonglab
jidonglab

Posted on

FP8 KV Cache Quantization: The Memory Math and the Accuracy Cliff

A single 128K-token conversation against Llama-3-70B holds roughly 40 GB of KV cache in FP16 — more than the model's activations, and often the reason your context length is capped long before the model's actual limit. Weight quantization gets all the attention, but on a busy inference server the KV cache is what runs you out of VRAM first. FP8 KV cache quantization halves that number, and INT4 quarters it. The catch is that the KV cache is not a safe place to be sloppy with bits, and the failure mode is subtle: your throughput graphs look great while your long-context retrieval quietly rots.

Key takeaways

  • KV cache size per token = 2 × layers × kv_heads × head_dim × bytes_per_element. The only term you control at serving time without retraining is bytes_per_element.
  • FP8 (E4M3) KV cache halves memory vs FP16 and is close to free on accuracy for most workloads — vLLM exposes it as one flag, kv_cache_dtype="fp8".
  • INT4 KV cache quarters memory but needs per-token or per-channel scaling (KIVI, KVQuant) to stay usable; naive INT4 falls off a cliff.
  • The accuracy cliff shows up on long-context retrieval and reasoning first, not on short chat. Test needle-in-a-haystack and multi-hop tasks, not perplexity.
  • Keys are harder to quantize than values. Keys have channel-wise outliers that wreck per-tensor scaling; values are flatter.

Why is the KV cache the thing that runs out of memory?

Because it grows linearly with every token and every concurrent request, while weights are fixed. During decoding, each new token attends to the keys and values of all previous tokens, so the model stores them instead of recomputing. The size of that store, per token, is:

kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element
                     ↑
                     one for K, one for V
Enter fullscreen mode Exit fullscreen mode

Plug in Llama-3-70B, which uses grouped-query attention with 8 KV heads:

2 × 80 layers × 8 kv_heads × 128 head_dim × 2 bytes (FP16)
= 327,680 bytes ≈ 320 KB per token
Enter fullscreen mode Exit fullscreen mode

At 128K context that's 320 KB × 131,072 ≈ 40 GB for one sequence. On an 80 GB H100 hosting a 70B model (~140 GB in FP16, so already sharded across GPUs), the KV cache — not the weights — decides how many users you can serve concurrently and how long their conversations can get.

GQA already cut num_kv_heads from 64 down to 8. RoPE scaling stretched the context window. The one remaining lever that doesn't require touching the model architecture is bytes_per_element. That's what KV cache quantization attacks.

What does FP8 KV cache quantization actually change?

It stores each key and value element in 8 bits instead of 16, halving bytes_per_element from 2 to 1. The attention math still runs in higher precision — you dequantize K and V on the fly inside the attention kernel — so this is a storage format, not a compute format. That distinction matters: you're not doing FP8 matmuls (though you can), you're just paying half the memory bandwidth and half the capacity to hold the cache.

FP8 comes in two flavors and the choice is not cosmetic:

  • E4M3 — 4 exponent bits, 3 mantissa bits. More precision, less range. This is the right default for KV cache. Keys and values are already bounded activations, so you want mantissa bits, not dynamic range.
  • E5M2 — 5 exponent, 2 mantissa. Wider range, coarser steps. Designed for gradients, not activations. Using it for the KV cache throws away precision you needed.

In vLLM this is a single knob:

from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    kv_cache_dtype="fp8",          # maps to fp8_e4m3 on Hopper/Ada
    # optional: load per-tensor scaling factors calibrated offline
    quantization_param_path="kv_scales.json",
    max_model_len=131072,
)
Enter fullscreen mode Exit fullscreen mode

Without a calibration file, vLLM uses a default scaling factor of 1.0, which works because E4M3 covers the typical activation range well enough. With calibration you compute per-tensor (or better, per-head) scaling factors from a representative dataset, and you claw back most of the residual error. On Hopper and Ada GPUs the FP8 path also lets the attention kernel read the cache at FP8 bandwidth, so you often get a small latency win on top of the capacity win — decoding is memory-bound, and you just halved the bytes moved per step.

Why does INT4 KV cache need special handling when FP8 mostly doesn't?

Because 4 bits gives you 16 levels, and a single per-tensor scale can't span the range of a KV tensor without crushing the small values to zero. FP8's exponent bits give it a floating scale for free; INT4 is uniform, so every design decision is about where you put the scale.

Two techniques define the practical space:

  • KIVI quantizes keys per-channel and values per-token. This asymmetry is the whole trick. The key tensor has a few channels with large-magnitude outliers (an artifact of how RoPE and the attention pattern concentrate energy), and a per-tensor scale would let those outliers dominate the quantization grid. Scaling each channel independently isolates them. Values don't have that structure, so per-token scaling is cheaper and sufficient.
  • KVQuant pushes further with non-uniform datatypes and outlier isolation, targeting near-lossless INT4 or even lower, at the cost of a calibration step and a custom kernel.

The reason you can't just point kv_cache_dtype="int4" and walk away is that the naive version — per-tensor uniform INT4 — visibly degrades. You'll see it as garbled long-range recall while short prompts still look fine, which is exactly the kind of bug that passes a smoke test and fails in production.

Where does the accuracy cliff actually show up?

On long-context retrieval and multi-step reasoning — not on perplexity, and not on short chat. This is the single most important thing to internalize, because it determines how you test.

Quantization error in the KV cache is per-token noise. On a short prompt there aren't many tokens, so the noise is small and averages out. As context grows, two things compound: there are more noisy keys competing in the softmax, and the tasks that use long context (find the one relevant fact in 100K tokens, chain three retrieved facts together) are precisely the ones where a slightly-wrong attention score flips the answer. A model that's fine at 4K can start missing needles at 64K under aggressive KV quantization.

So evaluate the way the failure manifests:

# Don't gate on perplexity — it barely moves under FP8 KV.
# Gate on tasks that stress long-range attention:
#   1. Needle-in-a-haystack at your MAX context length, multiple depths
#   2. Multi-hop QA where the answer requires two+ retrieved spans
#   3. Long-output code/reasoning where early tokens must stay consistent
#
# Compare FP16 KV vs FP8 vs INT4 on the SAME prompts and diff the outputs.
Enter fullscreen mode Exit fullscreen mode

Perplexity is a trap here: it's dominated by easy, high-frequency tokens and will tell you FP8 and even INT4 are "basically lossless" while your retrieval accuracy has dropped several points at long context.

FP8 or INT4 — which should you actually ship?

Ship FP8 E4M3 unless you have measured a specific capacity wall that INT4 solves. FP8 is the default recommendation because the cost/benefit is lopsided: one flag, half the KV memory, near-zero accuracy loss on most workloads, and a kernel path that's well supported on current hardware. You roughly double the number of concurrent long-context sessions per GPU for almost nothing.

Reach for INT4 (KIVI/KVQuant-style, with per-channel key scaling) only when FP8 still isn't enough — for example, serving very long contexts on constrained GPUs, or maximizing batch size on a memory-bound deployment. Budget for it: you'll need a calibration pass, a kernel that supports the asymmetric scaling, and a real long-context eval before and after. INT4 that skips the per-channel key handling is not a smaller-but-fine version of FP8; it's a different accuracy regime.

A few operational notes that bite people:

  • Keys are the hard part. If you're debugging INT4 degradation, quantize values to INT4 but keep keys at FP8 or INT8 as an experiment — if accuracy recovers, your key scaling is the problem.
  • Calibrate on in-distribution data. KV activation ranges shift with domain; scales tuned on generic web text can be wrong for code or a specialized RAG corpus.
  • Watch the interaction with prefix caching. Quantized KV blocks are still cacheable, but if you change kv_cache_dtype, you invalidate cached blocks — don't flip it under a warm cache and expect hits.

Bottom line: how much memory does FP8 KV cache save and what does it cost?

FP8 E4M3 KV cache quantization halves your KV cache memory — from ~320 KB to ~160 KB per token on a GQA-8 70B model, turning a 40 GB single-sequence 128K cache into ~20 GB — for a near-zero accuracy hit on typical workloads, enabled in vLLM with the single flag kv_cache_dtype="fp8". INT4 quarters the memory but only survives if you use per-channel scaling for keys and per-token scaling for values (KIVI/KVQuant), because keys carry channel outliers that uniform quantization destroys. The cost of both is paid on long-context retrieval and multi-hop reasoning, not on perplexity — so gate your rollout on needle-in-a-haystack and multi-hop QA at your real maximum context length, comparing quantized against FP16 on identical prompts. Start with FP8, measure, and only descend to INT4 when a specific capacity wall forces it.

Top comments (0)