The KV cache quantization results for Gemma 4 and Qwen 3.6 are worth unpacking a bit more, because the KL divergence framing tells a specific part of the story.
KL divergence measures distributional divergence on next-token prediction, but in practice the degradation at q4_0 KV cache tends to manifest unevenly — it's relatively benign on factual recall tasks and much more pronounced on tasks requiring precise numerical reasoning or long-range dependency tracking. This is because the KV cache at q4_0 introduces quantization noise that's roughly uniform across token positions, but the attention mechanism amplifies that noise on keys that have high cosine similarity (near-retrieval events). So your model might pass a perplexity benchmark while still failing on structured outputs that require attending back to an early numerical value in context.
The 384K output capability in DeepSeek v4 is architecturally interesting because generating that many output tokens implies either: (a) the KV cache grows to 384K entries during generation, which at FP16 on a 7B-class model is multiple hundreds of GB just for cache, or (b) they're using some form of sliding window or sparse attention to manage it. Most practical 384K output use cases probably stay in the few-10K range per actual call, so it functions more as a ceiling statement than an everyday operating point.
For local inference practitioners: q8_0 KV cache is almost always safe — the KL divergence bump is minimal and the VRAM savings are meaningful (roughly halving cache footprint vs. FP16). q4_0 is worth benchmarking specifically for your task distribution rather than relying on aggregate perplexity. The gap widens significantly for anything that requires precise token recall from early context.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
The KV cache quantization results for Gemma 4 and Qwen 3.6 are worth unpacking a bit more, because the KL divergence framing tells a specific part of the story.
KL divergence measures distributional divergence on next-token prediction, but in practice the degradation at q4_0 KV cache tends to manifest unevenly — it's relatively benign on factual recall tasks and much more pronounced on tasks requiring precise numerical reasoning or long-range dependency tracking. This is because the KV cache at q4_0 introduces quantization noise that's roughly uniform across token positions, but the attention mechanism amplifies that noise on keys that have high cosine similarity (near-retrieval events). So your model might pass a perplexity benchmark while still failing on structured outputs that require attending back to an early numerical value in context.
The 384K output capability in DeepSeek v4 is architecturally interesting because generating that many output tokens implies either: (a) the KV cache grows to 384K entries during generation, which at FP16 on a 7B-class model is multiple hundreds of GB just for cache, or (b) they're using some form of sliding window or sparse attention to manage it. Most practical 384K output use cases probably stay in the few-10K range per actual call, so it functions more as a ceiling statement than an everyday operating point.
For local inference practitioners: q8_0 KV cache is almost always safe — the KL divergence bump is minimal and the VRAM savings are meaningful (roughly halving cache footprint vs. FP16). q4_0 is worth benchmarking specifically for your task distribution rather than relying on aggregate perplexity. The gap widens significantly for anything that requires precise token recall from early context.