Before you can compress something, you need to know how big it is.
Most engineers know the KV cache is "large" but few have actually calculated the exact number. This post gives you the formula, a table for popular models, and a one-liner to compute it yourself.
The formula
KV cache bytes = 2 × L × H × d × T × 2
Where:
- 2 — one K tensor and one V tensor
- L — number of transformer layers
- H — number of attention heads (or KV heads for GQA models)
- d — head dimension (= hidden_size / num_heads)
- T — sequence length in tokens
- 2 — bytes per value in FP16
That's it. No approximation. This is the exact allocation.
Memory table: popular models
Llama-3-8B (L=32, H=8 KV heads, d=128)
| Context | KV cache |
|---|---|
| 4K tokens | 0.5 GB |
| 32K tokens | 4 GB |
| 128K tokens | 16 GB |
Mistral-7B (L=32, H=8 KV heads, d=128)
| Context | KV cache |
|---|---|
| 4K tokens | 0.5 GB |
| 32K tokens | 4 GB |
| 128K tokens | 16 GB |
Llama-3-70B (L=80, H=8 KV heads, d=128)
| Context | KV cache |
|---|---|
| 4K tokens | 5 GB |
| 32K tokens | 40 GB |
| 128K tokens | 160 GB |
Mixtral-8x7B (L=32, H=8 KV heads, d=128)
| Context | KV cache |
|---|---|
| 4K tokens | 0.5 GB |
| 32K tokens | 4 GB |
| 128K tokens | 16 GB |
Note: Mixtral uses the same attention architecture as Mistral-7B per expert; MoE only affects the FFN layers, so KV cache size is identical.
After NexusQuant compression
Llama-3-8B at 128K context (baseline: 16 GB)
| Preset | Compression | KV cache after | PPL delta |
|---|---|---|---|
high |
10x | 1.6 GB | +0.4% |
balanced |
17x | 0.94 GB | +1.3% |
max |
33x | 0.48 GB | +2.6% |
Llama-3-70B at 128K context (baseline: 160 GB — needs 2× A100)
| Preset | Compression | KV cache after | Fits on |
|---|---|---|---|
high |
10x | 16 GB | single A100 |
balanced |
17x | 9.4 GB | single A100 |
max |
33x | 4.8 GB | single A100 |
At high preset, a 70B model's 128K context KV cache drops from requiring a multi-GPU setup to fitting on a single 80 GB A100.
Python one-liner
kv_gb = lambda L, H, d, T: 2 * L * H * d * T * 2 / 1e9
Examples:
# Llama-3-8B at 128K
print(kv_gb(32, 8, 128, 128_000)) # 16.78 GB
# Llama-3-70B at 128K
print(kv_gb(80, 8, 128, 128_000)) # 167.77 GB
# Mistral-7B at 32K
print(kv_gb(32, 8, 128, 32_000)) # 4.19 GB
# Any model at any context
print(kv_gb(L=48, H=16, d=64, T=1_000_000)) # compute your own
For GQA models (Llama-3, Mistral), use the number of KV heads, not the full attention heads. For MHA models (GPT-2, older Llama), H = num_attention_heads.
How to read this from a HuggingFace config
from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
L = cfg.num_hidden_layers # 32
H = cfg.num_key_value_heads # 8 (GQA)
d = cfg.hidden_size // cfg.num_attention_heads # 4096 // 32 = 128
T = 128_000 # your target context
kv_gb = 2 * L * H * d * T * 2 / 1e9
print(f"KV cache at {T//1000}K tokens: {kv_gb:.1f} GB")
The practical takeaway
For a 7B-class model, KV cache at 128K is already 16 GB — as large as a mid-range GPU's total memory. At 1M tokens, it's 128 GB. The model weights themselves (in FP16) are only ~14 GB.
For 70B+ models, KV cache dominates entirely. Long-context inference on large models without compression is a multi-GPU problem by default.
NexusQuant makes it a single-GPU problem.
pip install nexusquant-kv
GitHub: https://github.com/jagmarques/nexusquant
Best regards, João Marques
Top comments (0)