DEV Community

Tech_Nuggets
Tech_Nuggets

Posted on

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says "can you do 32k?" and suddenly the math stops working. With BF16, the KV cache alone for a 70B Llama-3 at 32k context is roughly 2 × 80 layers × 8 KV heads × 32768 tokens × 128 head_dim × 2 bytes ≈ 10.7 GB per request. Two hundred of those, and the H100s are paging to CPU. The model itself fits; the attention state doesn't. This is the problem KV cache quantization is built for, and it's the natural follow-up to last week's piece on speculative decoding — because the two features interact in ways that don't always show up in vendor benchmarks.

Here's how it works, what the formats are, and where the footguns hide.

Why this matters in practice

The KV cache is the largest dynamic piece of memory in a serving LLM. The model weights are fixed at load time. The activations get freed after each forward pass. The KV cache grows with batch_size × seq_len and stays allocated until the request ends. On a long-context workload, it dominates.

KV cache quantization trades a small amount of representational precision for a 2x or 4x reduction in cache footprint, with no model-weight change. FP8 and INT8 give ~50% of the BF16 footprint. INT4 (KIVI, KVQuant, ZipCache-style) gives 25%. The question is what that compression costs in output quality, in serving complexity, and — the part most blog posts skip — in compatibility with the other serving features you already turned on.

The economic case is straightforward. Doubling the KV cache budget on a 70B at 32k means either ~21 GB more HBM (one extra H100 per ~10 concurrent users at 32k) or 2x fewer concurrent users per box. The quality cost of FP8 KV cache, measured on the standard long-context benchmarks, is typically under 0.5 percentage points on retrieval-heavy tasks. That's a 50% infra saving for a sub-half-point accuracy loss. The trade is favorable; the engineering is not free.

What KV cache quantization actually is

Standard BF16 attention stores the K and V tensors at full precision. At every attention step, the model reads every past K and V. Quantization compresses these stored tensors using a lower-precision format, with a dequantization step fused into the attention kernel right before the matmul.

The pipeline looks like this:

flowchart LR
    A[New token<br/>embedding] --> B[Project to Kt Vt<br/>BF16, in registers]
    B --> C[Quantize Kt Vt<br/>per-token / per-head]
    C --> D[Store in<br/>KV cache: FP8/INT8]
    D --> E[On next step:<br/>load cached K and V]
    E --> F[Dequantize on-the-fly<br/>inside attention kernel]
    F --> G[Attention matmul<br/>BF16, full precision]
    G --> H[Output projection]
Enter fullscreen mode Exit fullscreen mode

Three things to notice: the activations being added to the cache are quantized only at storage time, with the full BF16 values available for the scale calculation. The attention matmul still happens in BF16 or FP16 — you save memory bandwidth, not FLOPs. And the per-token or per-head scales (a few KB for an 8k context) are stored alongside in BF16; they are what makes the rest of the math work.

The formats you'll actually see

Five formats dominate production serving stacks in 2026. The list is in roughly the order they were adopted.

Format Bits Granularity Hardware support Used by
BF16 (baseline) 16 Native on Ampere+ Everything
FP8 E4M3 8 Per-tensor, per-head, or per-token H100, H200, B100, B200, MI300X vLLM, TRT-LLM, SGLang
FP8 E5M2 8 Same as above Same as above Less common for KV; wider dynamic range
INT8 (per-token) 8 Per-token, asymmetric Universal via Triton/CUDA vLLM, TGI, llama.cpp
INT4 (KVQuant / KIVI / ZipCache) 4 Mixed: K per-channel, V per-token Universal Research, llama.cpp (some targets)

A few notes on the table:

  • FP8 E4M3 vs E5M2. E4M3 has more precision, less range; E5M2 has more range, less precision. For KV cache, E4M3 dominates because the dynamic range of K and V activations is bounded by the softmax. E5M2 was originally specified for gradients.
  • INT8 per-token asymmetric. The workhorse format. Each token's K and V get their own (scale, zero_point) pair. Per-channel (one scale per head_dim slice) is faster on hardware but slightly less accurate. Per-tensor (one scale for the whole cache) is cheapest and loses the most.
  • Mixed-precision 4-bit (KVQuant, KIVI, ZipCache). Quantize K per-channel (where outliers live) and V per-token, getting 4-bit storage with much smaller accuracy loss than naive INT4. vLLM doesn't ship 4-bit KV as of v0.22.1; llama.cpp supports it on CPU and some Apple Silicon paths.
  • NVFP4 (E2M1 + block scales). A separate format for weights that landed in vLLM v0.22.0 (DeepSeek V4's NVFP4 fused MoE). Not a KV cache format — different scaling, different code path.

How a vLLM deploy uses it

The CLI flag is --kv-cache-dtype. In vLLM v0.22.1, accepted values are auto, fp8 (E4M3), fp8_e5m2, int8, and bf16 (the default; auto resolves to bf16 unless the model is detected as FP8-native). For an OpenAI-compatible serve:

vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92
Enter fullscreen mode Exit fullscreen mode

For programmatic use:

from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    tensor_parallel_size=8,
    kv_cache_dtype="fp8",
    max_model_len=32768,
)
Enter fullscreen mode Exit fullscreen mode

On H100, the FP8 path goes through Transformer Engine's fused attention; on B100/B200 it goes through FlashAttention-3 FP8 kernels. On pre-Hopper hardware (A100, RTX 4090) the FP8 flag is a no-op or a slow path — there's no native FP8 tensor core. INT8, by contrast, runs everywhere via Triton.

One production detail: --kv-cache-dtype fp8 on an H100 reduces KV cache memory by ~50% but does not reduce the model's weight footprint. The 70B in BF16 is still 140 GB. The savings are real but bounded by the cache-to-weight ratio of your workload — long-context, high-concurrency workloads benefit most.

How it interacts with speculative decoding

This is the silent footgun. Last week's post on speculative decoding described the acceptance probability r = min(1, M_p(x) / M_q(x)) and the speedup formula in terms of μ, the mean accepted tokens per cycle. KV cache quantization breaks the implicit assumption underneath: that the target model's logit at the proposal position is computed at the same numerical precision as the draft model's.

The mechanism:

  1. The draft model proposes a token x_t using its own KV cache (draft cache, typically BF16).
  2. The target model does one forward pass over K+1 positions to score all proposals. The target reads from its quantized KV cache, dequantizes on the fly, and runs attention in BF16.
  3. The acceptance check M_p(x_t) vs M_q(x_t) is still computed — but M_p is now using K and V values rounded to FP8 or INT8.
  4. The acceptance probability is still mathematically well-defined, but the target's distribution has shifted slightly relative to the BF16 baseline. This shift changes the empirical μ.

The magnitude depends on the format and context length. From community benchmarks and published work on spec-decoding with quantized caches, mean accepted tokens per cycle typically drops 0.3–0.8 for FP8 E4M3 and 0.5–1.5 for INT8 per-token. That sounds small until you remember the speedup curve has a knee around μ = 4. A drop from 4.5 to 3.5 can wipe out 20–30% of the speedup you thought you had.

The vLLM v0.18.0 release notes called this out for one specific case: degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618). The lesson generalizes: when stacking serving optimizations, each one shifts the optimal settings of the others. Speculative decoding was tuned assuming BF16 attention. Re-tune num_speculative_tokens and re-measure μ after turning on --kv-cache-dtype fp8.

Common pitfalls

  • "FP8" without specifying E4M3 vs E5M2. Different backends default differently. TRT-LLM often defaults to E5M2 for KV; vLLM to E4M3. They give different accuracy profiles. Pin the variant explicitly in your deploy config.
  • Assuming the savings apply to weights too. They don't. --kv-cache-dtype fp8 only changes the attention state. To compress the model, you need a separate quantization step (GPTQ, AWQ, FP8 weights) with its own quality/throughput tradeoffs.
  • Pre-Hopper GPUs. A100 and RTX 4090 do not have native FP8 tensor cores. The flag will be a slow path, a no-op, or (depending on the backend) silently fall back to BF16. Check that the path is actually executing.
  • Quantization-aware eval set. Quality loss from KV cache quantization is concentrated in long-context retrieval and counting tasks. If your eval set is GSM8K + MMLU, you'll see no difference. If it's Needle-in-a-Haystack at 32k+, you will.
  • Interaction with prefix caching. If you share a KV cache prefix across requests (a common RAG and chat-template trick), the cached prefix lives at the precision it was written at. Mixing FP8 and BF16 prefixes in the same engine is generally not supported — pick one and stick to it.
  • Forgetting to measure end-to-end throughput, not just memory. If you're already memory-bandwidth-bound, FP8 is a latency win (more users, less queueing) and a throughput wash. If you're compute-bound, FP8 doesn't help at all.

When NOT to use it

KV cache quantization is the wrong choice if:

  • You're on pre-Hopper GPUs and don't have a Triton-fused INT8 kernel path. The flag will be a no-op or a slow simulation. Don't enable it for the sake of consistency across clusters.
  • Your workload is short-context. If your median request is under 2k tokens, the KV cache isn't your bottleneck — activations, weights, and prefill compute are. Quantizing the cache won't move the needle.
  • You're stacking speculative decoding with a draft-target pair that's already on the edge of acceptance. If your measured μ is below 3.0 in BF16, the additional 0.3–1.0 acceptance-rate drop from FP8 will push you below 1.0 and turn the algorithm into a net loss. Measure first, then enable.
  • You're under a hard accuracy SLO that you can't re-validate. If your domain (medical, legal, financial) requires sub-0.1% regression, FP8 KV cache is not a switch you flip. It needs a per-deployment accuracy validation, not just a benchmark check.
  • Your model has heavy head-specific outliers. Some architectures (certain MoE routers, MLA with strong outlier channels) put a lot of magnitude in a few K/V values per head. Per-tensor and per-head quantization collapse badly here. Per-token scales are mandatory.

TL;DR

  • KV cache quantization compresses the per-request K and V tensors to FP8 or INT8, with dequantization fused into the attention kernel. The compute stays in BF16; the storage and memory bandwidth shrink.
  • The cache size scales as 2 × layers × kv_heads × seq_len × head_dim × bytes. For a 70B Llama-3 at 32k BF16, that's ~10.7 GB per request. FP8 halves it; INT8 halves it; 4-bit schemes quarter it.
  • In vLLM v0.22.1, set --kv-cache-dtype fp8 or int8. FP8 is H100/H200/B100/B200/MI300X only; INT8 runs everywhere via Triton.
  • The quality cost is usually under 0.5 points on long-context retrieval benchmarks, but the loss is concentrated — short-context evals hide it.
  • The speculative-decoding interaction is the silent footgun: FP8/INT8 caches shift the target model's logit distribution, which can drop the mean accepted tokens per cycle by 0.3–1.5. Re-tune num_speculative_tokens after enabling it.
  • Don't enable it on pre-Hopper GPUs without a Triton path, on short-context workloads, on top of a draft/target pair already at low acceptance rate, or under a hard accuracy SLO that hasn't been re-validated for the specific deployment.

Next post: prefix caching at scale — when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into a 5% saving in production.

Top comments (0)