DEV Community

João André Gomes Marques
João André Gomes Marques

Posted on

KV cache memory calculator: how much does your LLM actually use?

Before you can compress something, you need to know how big it is.

Most engineers know the KV cache is "large" but few have actually calculated the exact number. This post gives you the formula, a table for popular models, and a one-liner to compute it yourself.


The formula

KV cache bytes = 2 × L × H × d × T × 2
Enter fullscreen mode Exit fullscreen mode

Where:

  • 2 — one K tensor and one V tensor
  • L — number of transformer layers
  • H — number of attention heads (or KV heads for GQA models)
  • d — head dimension (= hidden_size / num_heads)
  • T — sequence length in tokens
  • 2 — bytes per value in FP16

That's it. No approximation. This is the exact allocation.


Memory table: popular models

Llama-3-8B (L=32, H=8 KV heads, d=128)

Context KV cache
4K tokens 0.5 GB
32K tokens 4 GB
128K tokens 16 GB

Mistral-7B (L=32, H=8 KV heads, d=128)

Context KV cache
4K tokens 0.5 GB
32K tokens 4 GB
128K tokens 16 GB

Llama-3-70B (L=80, H=8 KV heads, d=128)

Context KV cache
4K tokens 5 GB
32K tokens 40 GB
128K tokens 160 GB

Mixtral-8x7B (L=32, H=8 KV heads, d=128)

Context KV cache
4K tokens 0.5 GB
32K tokens 4 GB
128K tokens 16 GB

Note: Mixtral uses the same attention architecture as Mistral-7B per expert; MoE only affects the FFN layers, so KV cache size is identical.


After NexusQuant compression

Llama-3-8B at 128K context (baseline: 16 GB)

Preset Compression KV cache after PPL delta
high 10x 1.6 GB +0.4%
balanced 17x 0.94 GB +1.3%
max 33x 0.48 GB +2.6%

Llama-3-70B at 128K context (baseline: 160 GB — needs 2× A100)

Preset Compression KV cache after Fits on
high 10x 16 GB single A100
balanced 17x 9.4 GB single A100
max 33x 4.8 GB single A100

At high preset, a 70B model's 128K context KV cache drops from requiring a multi-GPU setup to fitting on a single 80 GB A100.


Python one-liner

kv_gb = lambda L, H, d, T: 2 * L * H * d * T * 2 / 1e9
Enter fullscreen mode Exit fullscreen mode

Examples:

# Llama-3-8B at 128K
print(kv_gb(32, 8, 128, 128_000))   # 16.78 GB

# Llama-3-70B at 128K
print(kv_gb(80, 8, 128, 128_000))   # 167.77 GB

# Mistral-7B at 32K
print(kv_gb(32, 8, 128, 32_000))    # 4.19 GB

# Any model at any context
print(kv_gb(L=48, H=16, d=64, T=1_000_000))  # compute your own
Enter fullscreen mode Exit fullscreen mode

For GQA models (Llama-3, Mistral), use the number of KV heads, not the full attention heads. For MHA models (GPT-2, older Llama), H = num_attention_heads.


How to read this from a HuggingFace config

from transformers import AutoConfig

cfg = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
L = cfg.num_hidden_layers          # 32
H = cfg.num_key_value_heads        # 8  (GQA)
d = cfg.hidden_size // cfg.num_attention_heads  # 4096 // 32 = 128

T = 128_000  # your target context
kv_gb = 2 * L * H * d * T * 2 / 1e9
print(f"KV cache at {T//1000}K tokens: {kv_gb:.1f} GB")
Enter fullscreen mode Exit fullscreen mode

The practical takeaway

For a 7B-class model, KV cache at 128K is already 16 GB — as large as a mid-range GPU's total memory. At 1M tokens, it's 128 GB. The model weights themselves (in FP16) are only ~14 GB.

For 70B+ models, KV cache dominates entirely. Long-context inference on large models without compression is a multi-GPU problem by default.

NexusQuant makes it a single-GPU problem.

pip install nexusquant-kv
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/jagmarques/nexusquant


Best regards, João Marques

Top comments (0)