DEV Community

zxpmail
zxpmail

Posted on

KV Cache Is Eating Your VRAM — Here's How to Estimate It Before You Run Out

Every LLM inference engineer hits this wall eventually.

You deployed a model, it works in testing, then production traffic arrives. Suddenly your 80GB A100 is OOM on a 70B model that "should fit."

The culprit is almost always the KV Cache. But most discussions stop at "it caches the Key and Value matrices" — which doesn't help you predict when you'll run out of memory.

This post gives you a quick estimator formula, explains when to worry, and what levers actually help.


The One-Number Formula

Here's the quick estimator:

KV Cache Memory (GB) = 2 × (layers) × (hidden_dim) × (context_length) × (bytes_per_param)

The leading 2 is because you cache both K and V.

For Llama 3.1 70B (80 layers, hidden_dim 8192, FP16):

  • Per token: 2 × 80 × 8192 × 2 bytes = 2.6 MB
  • At 8K context: 2.6 MB × 8192 = 21 GB
  • At 128K context: 2.6 MB × 131072 = 340 GB (doesn't fit on one A100)

That's right: the KV cache for a 70B model at 128K context requires 340GB of memory — more than the model weights themselves (140GB in FP16).

In most inference scenarios, the KV cache is the bottleneck, not the weights.


Why It Matters More Than Weights

Model weights are static. You load them once, they sit in VRAM. 70B in FP16 = ~140GB. That's a known cost.

KV Cache is dynamic. It grows linearly with:

  • Batch size — cached for every sequence in the batch
  • Context length — cached for every token position
  • Number of layers — cached for every transformer layer (the full stack)

The wall you'll hit first:

Scenario Weights KV Cache (8K) KV Cache (128K)
70B, batch=1, FP16 140GB 21GB 340GB — OOM
70B, batch=4, FP16 140GB 84GB 1.3TB — OOM
7B, batch=32, 8K, FP16 14GB 9GB 150GB — OOM

At long contexts or high batch sizes, the KV cache dominates total memory — and it's the part that grows with traffic, not the part you can amortize.

If you're running Speculative Decoding (theory, benchmarks), both the draft model and the target model maintain their own KV caches. For a 7B draft + 70B target pair, the draft adds roughly 10-15% more KV cache memory on top of the target's — a factor worth including in your estimate.


What Actually Reduces KV Cache Memory

There are six levers, and they're not all created equal.

Lever 1: Multi-Query Attention (MQA) / Grouped Query Attention (GQA)

This is the most impactful architectural fix. Instead of caching K and V for every attention head, share K and V across query heads.

  • Original MHA: KV cache per layer = 2 × hidden_dim
  • GQA (8 groups): KV cache per layer = 2 × hidden_dim / group_size (where group_size = num_attn_heads / kv_heads, e.g. 64/8 = 8)
  • MQA (1 group): KV cache per layer = 2 × hidden_dim / num_attn_heads

In practice: Llama 3.1 70B uses GQA with 8 key-value heads. That reduces the KV cache to about 1/8 of what MHA would require — roughly 2.6 MB per token0.33 MB per token.

Architecture KV per token (70B, FP16, 8192 hidden, 64 attn heads, head_dim=128)
MHA (64 KV heads) 2.6 MB
GQA (8 KV heads) 0.33 MB
MQA (1 KV head) 0.04 MB

GQA is a free lunch. It barely affects quality and cuts cache memory by 4-8×. If your model doesn't use it, consider switching.

Lever 2: Quantization (FP16 → FP8 → INT4)

KV Cache is less sensitive to quantization than weights. You can usually go to FP8 or INT4 without meaningful quality loss.

Precision Bytes per param KV cache for 7B, 8K, batch=16
FP16 2 18 GB
FP8 1 9 GB
INT4 0.5 4.5 GB

KV cache quantization is supported by most inference frameworks (TensorRT-LLM, vLLM, AWQ). The quality impact is minimal because KV cache errors are per-token, not accumulated across tokens.

Lever 3: Sliding Window Attention

Instead of caching all positions, only cache the last N tokens. For models that use ALiBi or Rotary Position Encoding without a strict context limit, this can cap KV cache growth.

The tradeoff: the model loses access to tokens beyond the window. For tasks that need long-range dependencies (summarization, document QA), this degrades quality.

For conversational or streaming use cases, sliding window is a no-brainer. For RAG, it depends on where in the context the relevant information sits.

Lever 4: PagedAttention (vLLM)

vLLM's contribution is memory management, not cache reduction. It fragments less.

Traditional inference allocates contiguous blocks per sequence. If a sequence has 512 tokens of cache and the allocator uses 1024-sized blocks, 50% is wasted.

PagedAttention allocates in smaller (16-256 token) pages, reducing fragmentation from 30-50% down to 1-4%.

Net effect: 30-50% effective memory gain on the same hardware, with no quality impact and no model changes.

This is why teams see such dramatic improvements switching to vLLM — it's not faster compute, it's better memory packing.

Lever 5: Reduce Context Length

This is the most brute-force lever, but sometimes the right one.

Max context KV cache (7B, FP16, batch=16)
2K 2.3 GB
8K 9 GB
32K 36 GB
128K 144 GB

If 99% of your requests are under 4K tokens, don't support 128K context. Supporting a context length you don't use is burning VRAM for no reason.

Frameworks like vLLM support per-request context limits — you can set max_model_len to fit your workload rather than the model's theoretical maximum.

Lever 6: Use a Smaller Model

Sometimes the best optimization is admitting the model is too big for your use case.

A 7B model with full 128K context costs more in KV cache than a 70B model with 2K context. If your task needs long context, a smaller model at a higher context length may use less total memory than a large model at the same context.


The Quick Decision Tree

Run out of KV cache memory? Here's the order to try:

1. Switch to vLLM. ~30-50% effective memory gain. No model changes. Start here.

2. Quantize KV cache to FP8. ~2× memory reduction. Minimal quality impact.

3. Check GQA groups. If your model has full MHA, find a GQA variant. 4-8× reduction.

4. Implement sliding window or reduce max context. Only if your workload allows it.

5. Quantize to INT4. ~4× reduction from FP16. Test quality impact on your data first.

6. Reduce batch size. Last resort. Hurts throughput.


A Quick Estimator Script

def kv_cache_memory(layers, hidden_dim, context_len, batch_size, kv_heads, num_attn_heads, bytes_per_param=2):
    """
    Estimate KV cache memory in GB.

    layers: number of transformer layers
    hidden_dim: model hidden dimension
    context_len: max context length in tokens
    batch_size: number of concurrent sequences
    kv_heads: number of KV heads (1 for MQA, n for GQA, num_attn_heads for MHA)
    num_attn_heads: number of attention heads
    bytes_per_param: 2 for FP16, 1 for FP8, 0.5 for INT4
    """
    head_dim = hidden_dim // num_attn_heads
    kv_per_position = 2 * layers * kv_heads * head_dim * bytes_per_param
    total = kv_per_position * context_len * batch_size
    return total / (1024**3)  # convert to GB

# Example: Llama 3.1 70B, 8K context, batch=4, GQA-8
# layers=80, hidden_dim=8192, attn_heads=64, kv_heads=8
print(f"{kv_cache_memory(80, 8192, 8192, 4, 8, 64, 2):.1f} GB")  # ~10.0 GB

# Same model, MHA (kv_heads = attn_heads = 64)
print(f"{kv_cache_memory(80, 8192, 8192, 4, 64, 64, 2):.1f} GB")  # ~80.0 GB
Enter fullscreen mode Exit fullscreen mode

Run it before you deploy. It's cheaper than an OOM at 3 AM.


Closing

The KV cache is the silent memory killer in LLM inference. Model weights get all the attention — they're static, visible, and easy to estimate. The KV cache is dynamic, grows with traffic, and often exceeds the weight memory at production batch sizes and context lengths.

The fix isn't one lever. It's knowing which lever to pull first.

Start with memory management (vLLM). Then quantization (FP8). Then architecture (GQA). Then context limits. In that order. Most teams will run out of problems before they run out of levers.

And if you're exploring Speculative Decoding — the acceleration technique comes with its own memory tax: both models need room for their KV caches. Make sure your estimate accounts for both.

KV cache memory estimation should be part of your pre-deployment checklist. Two lines of Python will tell you if a 3 A.M. page is waiting for you.


*June 2026. One formula, six levers, one decision tree. Estimate before you deploy — it's cheaper than an OOM at 3 AM.

Top comments (0)