João André Gomes Marques

Posted on Apr 7 • Edited on May 4 • Originally published at dev.to

KV cache memory calculator: how much does your LLM actually use?

#llm #machinelearning #python #gpu

Before you can compress something, you need to know how big it is.

Most engineers know the KV cache is "large" but few have actually calculated the exact number. This post gives you the formula, a table for popular models, and a one-liner to compute it yourself.

The formula

KV cache bytes = 2 × L × H × d × T × 2

Where:

2 - one K tensor and one V tensor
L - number of transformer layers
H - number of attention heads (or KV heads for GQA models)
d - head dimension (= hidden_size / num_heads)
T - sequence length in tokens
2 - bytes per value in FP16

That's it. No approximation. This is the exact allocation.

Memory table: popular models

Llama-3-8B (L=32, H=8 KV heads, d=128)

Context	KV cache
4K tokens	0.5 GB
32K tokens	4 GB
128K tokens	16 GB

Mistral-7B (L=32, H=8 KV heads, d=128)

Context	KV cache
4K tokens	0.5 GB
32K tokens	4 GB
128K tokens	16 GB

Llama-3-70B (L=80, H=8 KV heads, d=128)

Context	KV cache
4K tokens	5 GB
32K tokens	40 GB
128K tokens	160 GB

Mixtral-8x7B (L=32, H=8 KV heads, d=128)

Context	KV cache
4K tokens	0.5 GB
32K tokens	4 GB
128K tokens	16 GB

Note: Mixtral uses the same attention architecture as Mistral-7B per expert; MoE only affects the FFN layers, so KV cache size is identical.

After NexusQuant compression

Llama-3-8B at 128K context (baseline: 16 GB)

Preset	Compression	KV cache after	PPL delta
`high`	10x	1.6 GB	+0.4%
`balanced`	17x	0.94 GB	+1.3%
`max`	33x	0.48 GB	+2.6%

Llama-3-70B at 128K context (baseline: 160 GB - needs 2× A100)

Preset	Compression	KV cache after	Fits on
`high`	10x	16 GB	single A100
`balanced`	17x	9.4 GB	single A100
`max`	33x	4.8 GB	single A100

At high preset, a 70B model's 128K context KV cache drops from requiring a multi-GPU setup to fitting on a single 80 GB A100.

Python one-liner

kv_gb = lambda L, H, d, T: 2 * L * H * d * T * 2 / 1e9

Examples:

# Llama-3-8B at 128K
print(kv_gb(32, 8, 128, 128_000))   # 16.78 GB

# Llama-3-70B at 128K
print(kv_gb(80, 8, 128, 128_000))   # 167.77 GB

# Mistral-7B at 32K
print(kv_gb(32, 8, 128, 32_000))    # 4.19 GB

# Any model at any context
print(kv_gb(L=48, H=16, d=64, T=1_000_000))  # compute your own

For GQA models (Llama-3, Mistral), use the number of KV heads, not the full attention heads. For MHA models (GPT-2, older Llama), H = num_attention_heads.

How to read this from a HuggingFace config

from transformers import AutoConfig

cfg = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
L = cfg.num_hidden_layers          # 32
H = cfg.num_key_value_heads        # 8  (GQA)
d = cfg.hidden_size // cfg.num_attention_heads  # 4096 // 32 = 128

T = 128_000  # your target context
kv_gb = 2 * L * H * d * T * 2 / 1e9
print(f"KV cache at {T//1000}K tokens: {kv_gb:.1f} GB")

The practical takeaway

For a 7B-class model, KV cache at 128K is already 16 GB - as large as a mid-range GPU's total memory. At 1M tokens, it's 128 GB. The model weights themselves (in FP16) are only ~14 GB.

For 70B+ models, KV cache dominates entirely. Long-context inference on large models without compression is a multi-GPU problem by default.

NexusQuant makes it a single-GPU problem.

pip install nexusquant-kv

GitHub: https://github.com/jagmarques/nexusquant

Best regards, João Marques

DEV Community