DEV Community

plasmon
plasmon

Posted on

Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke

Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke

The biggest VRAM hog in LLM inference isn't always the model weights.

Once context length grows, KV cache memory consumption overtakes the model itself. Llama-3-8B (Q4_K_M, 4.9GB) at 32K context burns roughly 4GB on KV cache alone. That's 9GB total. An RTX 4060 8GB can't hold it.

# KV cache memory calculation
def kv_cache_memory(
    n_layers: int,
    n_heads_kv: int,
    head_dim: int,
    context_length: int,
    dtype_bytes: int = 2,  # FP16
) -> float:
    """KV cache memory usage in GB"""
    # K + V, two tensors
    bytes_total = 2 * n_layers * n_heads_kv * head_dim * context_length * dtype_bytes
    return bytes_total / (1024 ** 3)

# Llama-3-8B (GQA: 8 KV heads)
llama3_8b = kv_cache_memory(
    n_layers=32,
    n_heads_kv=8,      # GQA: 32 attention heads -> 8 KV heads
    head_dim=128,
    context_length=32768,  # 32K
    dtype_bytes=2,      # FP16
)
# -> 4.0 GB

# Qwen2.5-32B (GQA: 8 KV heads)
qwen25_32b = kv_cache_memory(
    n_layers=64,
    n_heads_kv=8,
    head_dim=128,
    context_length=32768,
    dtype_bytes=2,
)
# -> 8.0 GB — KV cache alone fills 8GB. No room left for the model.
Enter fullscreen mode Exit fullscreen mode

You can quantize the model to fit in VRAM, but if the KV cache stays FP16, it doesn't matter. The moment you stretch the context, VRAM overflows.

That's where KV cache quantization comes in.


What KV Cache Quantization Actually Is

How It Differs from Model Quantization

Model weight quantization (GGUF Q4_K_M, etc.) is well understood. KV cache quantization solves a fundamentally different problem.

# Two types of quantization compared
quantization_types = {
    "Model weight quantization": {
        "target": "Pre-trained parameters (offline)",
        "timing": "One-time conversion before inference",
        "quality_impact": "Well-studied. Q4_K_M is practical for most tasks",
        "tools": "llama.cpp GGUF, GPTQ, AWQ, bitsandbytes",
        "vram_effect": "Reduces proportional to model size",
    },
    "KV cache quantization": {
        "target": "Attention intermediate states generated dynamically during inference",
        "timing": "Real-time quantization on every token generation",
        "quality_impact": "Active research area. Highly task-dependent",
        "tools": "llama.cpp (--cache-type-k, --cache-type-v), vLLM (FP8)",
        "vram_effect": "Reduces proportional to context length",
    },
}
Enter fullscreen mode Exit fullscreen mode

The critical distinction: model weight quantization compresses static data, done once before inference. KV cache quantization compresses data generated in real time during inference, happening on every forward pass. That means overhead.

Using It in llama.cpp

llama.cpp supports KV cache quantization through --cache-type-k and --cache-type-v options.

# FP16 KV cache (default)
llama-cli -m model.gguf -c 32768

# Q8_0 KV cache (halves memory, minimal quality impact)
llama-cli -m model.gguf -c 32768 --cache-type-k q8_0 --cache-type-v q8_0

# Q4_0 KV cache (quarters memory, noticeable quality impact)
llama-cli -m model.gguf -c 32768 --cache-type-k q4_0 --cache-type-v q4_0
Enter fullscreen mode Exit fullscreen mode

Memory Math on an RTX 4060 8GB

Llama-3-8B Q4_K_M + KV Cache Quantization

# VRAM usage estimates on RTX 4060 8GB
configs = {
    "FP16 KV (default)": {
        "model": 4.9,        # Q4_K_M
        "kv_32k": 4.0,       # FP16
        "overhead": 0.5,     # CUDA context, etc.
        "total": 9.4,
        "fits_8gb": False,
    },
    "Q8_0 KV": {
        "model": 4.9,
        "kv_32k": 2.0,       # Half of FP16
        "overhead": 0.5,
        "total": 7.4,
        "fits_8gb": True,    # Barely
    },
    "Q4_0 KV": {
        "model": 4.9,
        "kv_32k": 1.0,       # Quarter of FP16
        "overhead": 0.5,
        "total": 6.4,
        "fits_8gb": True,    # Comfortable
    },
}

# FP16 can't do 32K context at all
# Q8_0 barely fits
# Q4_0 leaves 1.6GB headroom -> room to co-locate an embedding model
Enter fullscreen mode Exit fullscreen mode

Practical Configurations Compared

Config 1: Llama-3-8B Q4_K_M + FP16 KV
  Context: ~16K max (VRAM 7.4GB)
  Use case: Short conversations, code generation

Config 2: Llama-3-8B Q4_K_M + Q8_0 KV
  Context: Up to 32K (VRAM 7.4GB)
  Use case: Medium-length document processing, longer conversations
  Quality: Nearly identical to FP16 (details below)

Config 3: Llama-3-8B Q4_K_M + Q4_0 KV
  Context: 32K (VRAM 6.4GB) -> can co-locate BGE-M3
  Use case: RAG + long context
  Quality: Degradation visible on some tasks

Config 4: Qwen2.5-32B Q4_K_M + Q4_0 KV (partial offload)
  Model: 18GB -> GPU 7.5GB + CPU 10.5GB
  KV (8K): Q4_0 at 0.5GB -> GPU
  Use case: Short context but need high-quality answers
Enter fullscreen mode Exit fullscreen mode

The essence of KV cache quantization: it reshapes the tradeoff between model size and context length. With FP16 KV, you're stuck with "small model x short context." With Q4_0 KV, you unlock "small model x long context" and "large model x short context + RAG."


Where Quality Breaks Down

Findings from the KIVI Paper

KIVI (Liu et al., 2024, arXiv:2402.02750) provides the most systematic study of KV cache quantization.

# KIVI: Key-Value Cache Quantization — key findings
kivi_findings = {
    "method": "Key uses per-channel quantization, Value uses per-token quantization",
    "rationale": {
        "Key": "Value distribution varies widely across channels -> per-channel is appropriate",
        "Value": "Value distribution varies widely across tokens -> per-token is appropriate",
    },
    "results": {
        "2bit_KV": "Downstream task accuracy drops by at most ~2%",
        "2bit_KV_longbench": "LongBench: 44.27 vs FP16's 44.52 (0.56% gap)",
        "VRAM_savings": "2-bit vs FP16: 87.5% KV cache reduction (1/8). Paper reports 2.6x total peak memory reduction",
    },
    "critical_note": "Key and Value need different quantization axes. Quantizing both the same way causes quality to collapse",
}
Enter fullscreen mode Exit fullscreen mode

KIVI's core finding: Key and Value quantization must be designed separately. Each Key channel's value range is stable across tokens but varies wildly between channels. Value is the opposite. Apply the same quantization scheme to both, and one of them falls apart.

Which Tasks Break First

# KV cache quantization tolerance by task type
task_sensitivity = {
    "High tolerance (Q4 still practical)": [
        "Simple Q&A (fact retrieval)",
        "Summarization (short to short)",
        "Classification tasks",
        "Code completion (short context)",
    ],
    "Medium tolerance (Q8 recommended)": [
        "Long document summarization (16K+ tokens)",
        "Multi-turn conversation (10+ turns)",
        "Document reference in RAG",
        "Translation (especially technical docs)",
    ],
    "Low tolerance (FP16 recommended)": [
        "Mathematical reasoning (CoT)",
        "Exact numerical citation",
        "Long-range information retrieval (needle-in-haystack)",
        "Code logical consistency (long functions)",
    ],
}
Enter fullscreen mode Exit fullscreen mode

A pattern emerges. Tasks that require precise information retention are the most sensitive to quantization. Summarization and classification survive on the "gist" of the input, but math and needle-in-haystack depend on exact bit-level values of specific tokens.

This mirrors model weight quantization behavior. Q4_K_M handles casual conversation fine but shows degradation on math benchmarks. The same law applies to KV cache quantization.


Implementation Pattern: Dynamic Switching Based on Context Length

The ideal setup switches KV cache quantization level based on context length. Short context gets FP16 for maximum quality. As it grows, drop to Q8_0 or Q4_0 to prevent overflow.

# Auto-select KV cache config based on context length
def select_kv_config(
    model_vram_gb: float,
    gpu_vram_gb: float = 8.0,
    target_context: int = 32768,
    n_layers: int = 32,
    n_heads_kv: int = 8,
    head_dim: int = 128,
) -> dict:
    """Select KV cache config based on available VRAM"""
    overhead = 0.5  # CUDA context, etc.
    available = gpu_vram_gb - model_vram_gb - overhead

    kv_sizes = {}
    for dtype, factor in [("f16", 2), ("q8_0", 1), ("q4_0", 0.5)]:
        bytes_per_token = 2 * n_layers * n_heads_kv * head_dim * factor
        max_ctx = int(available * (1024**3) / bytes_per_token)
        kv_sizes[dtype] = {
            "max_context": max_ctx,
            "vram_at_target": (bytes_per_token * target_context) / (1024**3),
        }

    # Pick the highest quality that fits
    for dtype in ["f16", "q8_0", "q4_0"]:
        if kv_sizes[dtype]["max_context"] >= target_context:
            return {
                "cache_type": dtype,
                "max_context": kv_sizes[dtype]["max_context"],
                "kv_vram": kv_sizes[dtype]["vram_at_target"],
                "total_vram": model_vram_gb + kv_sizes[dtype]["vram_at_target"] + overhead,
            }

    return {"cache_type": "q4_0", "max_context": kv_sizes["q4_0"]["max_context"],
            "note": "target_context exceeds available VRAM even with Q4_0"}

# Llama-3-8B Q4_K_M on RTX 4060 8GB
config = select_kv_config(model_vram_gb=4.9, target_context=32768)
# -> {"cache_type": "q8_0", "max_context": ~33000, "kv_vram": 2.0, "total_vram": 7.4}
# Q8_0 barely reaches 32K

# Qwen3.5-4B Q4_K_M on RTX 4060 8GB
# Note: Qwen3.5-4B uses a hybrid architecture (32 layers, only 8 full-attention layers, rest are Gated DeltaNet)
# KV cache only applies to the 8 attention layers
config_small = select_kv_config(model_vram_gb=2.7, n_layers=8, n_heads_kv=4, target_context=32768)
# -> FP16 fits 32K easily (2.7 + 0.5 + 0.5 = 3.7GB, tons of room)
Enter fullscreen mode Exit fullscreen mode

With a small enough model, you might not need KV cache quantization at all. Qwen3.5-4B (~2.7GB) has a hybrid architecture where only 8 of 32 layers use traditional attention with KV cache. FP16 KV at 32K context uses barely 0.5GB. Sometimes "use a smaller high-accuracy model at full precision" beats "cram a bigger model in with quantization."


llama.cpp KV Cache Quantization Internals

Q8_0 vs Q4_0 Under the Hood

# llama.cpp KV cache quantization schemes
kv_quant_details = {
    "q8_0": {
        "bit_width": 8,
        "block_size": 32,
        "method": "absmax symmetric quantization",
        "computation": "For each 32-element block, find max(abs(x)) and store one scale factor",
        "quality": "Near-identical to FP16 (KIVI paper shows negligible degradation at Q8)",
        "speed": "Decode speed equal to or slightly faster than FP16 (reduced memory transfer)",
        "recommendation": "Safe drop-in replacement for default",
    },
    "q4_0": {
        "bit_width": 4,
        "block_size": 32,
        "method": "absmax symmetric quantization (4-bit)",
        "computation": "Compress each 32 elements to 4-bit. One scale factor",
        "quality": "Task-dependent. Fine for simple tasks, degrades on math and long-range retrieval",
        "speed": "Decode can be faster when bandwidth-bound (reduced transfer volume)",
        "recommendation": "Only when VRAM constraints are severe",
    },
}
Enter fullscreen mode Exit fullscreen mode

Impact on Decode Speed

KV cache quantization has a surprising side effect. When memory bandwidth is the bottleneck during decode, smaller KV cache means less data to transfer, which can make decoding faster.

# Estimated decode speed impact on RTX 4060 (272 GB/s)
decode_impact = {
    "FP16 KV, 32K context": {
        "kv_read_per_token": "4.0 GB (KV across all layers)",
        "theoretical_speed": "272 / (4.9 + 4.0) = 30 t/s",
        "note": "Reads model weights + entire KV for attention computation",
    },
    "Q8_0 KV, 32K context": {
        "kv_read_per_token": "2.0 GB",
        "theoretical_speed": "272 / (4.9 + 2.0) = 39 t/s",
        "improvement": "+30%",
    },
    "Q4_0 KV, 32K context": {
        "kv_read_per_token": "1.0 GB",
        "theoretical_speed": "272 / (4.9 + 1.0) = 46 t/s",
        "improvement": "+53%",
    },
}
# KV cache quantization doesn't just "make it fit" — it makes it faster
# Though dequantization CPU/GPU overhead means real numbers will be lower
Enter fullscreen mode Exit fullscreen mode

Not just VRAM savings — bandwidth savings too. This is effectively a Layer 2 optimization (reducing read volume) against the bandwidth wall (separate article). A software-side weapon that increases effective bandwidth without changing hardware.


Combining with Other KV Cache Optimizations

KV cache quantization isn't meant to be used alone. It hits hardest when combined with other optimizations.

Method                  VRAM Reduction  Quality Impact  Implementation
────────────────────────────────────────────────────────────────────────
GQA (Grouped Query)     50-75%          None            Decided at model design time
KV Quant (Q8_0)         50%             Negligible      One llama.cpp flag
KV Quant (Q4_0)         75%             Task-dependent  One llama.cpp flag
Sliding Window          Fixed cap       Long-range loss  Decided at model design time
Sparse Attention        Significant     Task-dependent  Requires custom implementation
Paged Attention         Defrag only     None            Automatic in vLLM, etc.

Combined example: Llama-3-8B (GQA 4x) + Q8_0 KV
  GQA 75% reduction x Q8_0 50% reduction = 12.5% of baseline
  FP16 full cache: 16GB -> 2.0GB
  32K context fits comfortably on an 8GB GPU
Enter fullscreen mode Exit fullscreen mode

GQA (Grouped Query Attention) is already baked into most recent models, so users don't need to think about it. Llama-3, Qwen2.5, Mistral all benefit from it. Stacking KV quantization on top cuts it by another half to quarter.


When You Shouldn't Quantize on 8GB

I've been making the case for KV cache quantization, but the opposite perspective matters too.

# Cases where KV cache quantization is unnecessary
unnecessary_cases = {
    "Small model + short context": {
        "example": "Qwen3.5-4B (~2.7GB, only 8 attention layers) + 8K context",
        "KV_FP16": "~0.13 GB (8 attention layers × 4 KV heads)",
        "total": "~3.3 GB",
        "verdict": "Under half of 8GB. Quantization is pointless",
    },
    "Task-specialized models": {
        "example": "Function calling only (short I/O)",
        "KV_FP16": "< 0.1 GB",
        "verdict": "Short context means KV cache is never the problem",
    },
    "Accuracy-critical tasks": {
        "example": "Math reasoning, code review (correctness is paramount)",
        "verdict": "Better to use a smaller model and keep KV at FP16",
    },
}
Enter fullscreen mode Exit fullscreen mode

The realistic approach is to combine this with a multi-model routing strategy (separate article) — switching models based on the task:

  • Function calling -- Qwen3.5-4B (~2.7GB) + FP16 KV. Short context + few attention layers, no quantization needed
  • Long-context RAG -- Llama-3-8B (4.9GB) + Q8_0 KV. 32K context with quality preserved
  • Knowledge Q&A -- Qwen2.5-32B (18GB, CPU/GPU offload) + Q4_0 KV. Short context only

Don't apply the same KV config to every model. Pick based on task and context length.


References

  1. "KIVI: A Tuning-Free KV Cache Quantization Plugin for Large Language Models" (2024) arXiv:2402.02750
  2. llama.cpp KV cache quantization: --cache-type-k, --cache-type-v options
  3. "PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection" (2026) arXiv:2603.21576
  4. "Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023) arXiv:2309.06180

Top comments (0)