How Much GPU Memory Does NexusQuant Actually Save?
KV cache compression numbers like "10x" sound impressive in a paper. But what does that mean in practice, for a real GPU, serving real users? Let me give you a concrete memory calculator so you can answer this for your own setup.
The KV Cache Formula
For any transformer, the KV cache size is:
KV_bytes = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element
The 2 is for keys AND values. bytes_per_element is 2 for FP16, 4 for FP32.
For Mistral-7B (32 layers, 8 KV heads, head_dim=128, FP16):
KV_bytes = 2 × 32 × 8 × 128 × seq_len × 2
= 131,072 × seq_len bytes
≈ 128 KB per token
At 128K tokens: 128 KB × 131,072 = 16.7 GB just for the KV cache.
GPU Memory Table
Here's what that means on real hardware:
| GPU | VRAM | Max KV tokens (FP16, no NQ) | With NexusQuant 10x | With NexusQuant 17x | With NexusQuant 33x |
|---|---|---|---|---|---|
| RTX 3090 | 24 GB | ~150K | ~1.5M | ~2.6M | ~5M |
| A10G | 24 GB | ~150K | ~1.5M | ~2.6M | ~5M |
| A100 40GB | 40 GB | ~256K | ~2.6M | ~4.4M | ~8.5M |
| A100 80GB | 80 GB | ~512K | ~5.1M | ~8.7M | ~17M |
| H100 80GB | 80 GB | ~512K | ~5.1M | ~8.7M | ~17M |
Numbers leave 8 GB headroom for model weights and activations on 7B models.
Practical Scenarios
Scenario A: A10G serving 128K context
Without NexusQuant: 128K × 128 KB = 16.7 GB. With only 24 GB total and ~14 GB for the 7B model, you're OOM before the first token generates.
With NexusQuant 17x preset: 16.7 / 17 = 0.98 GB for KV. Fits comfortably. You can now serve 128K context on a single A10G.
Scenario B: A100 80GB, maximize throughput
A single request at 128K takes 16.7 GB KV. With 80 GB total and ~14 GB for weights, you have ~66 GB for KV — enough for about 4 concurrent sessions at 128K.
At 17x compression, each session uses ~1 GB of KV. You can serve ~66 concurrent sessions instead of 4. That's 16x more users per GPU.
Scenario C: Long context research
You want 1M token context on Mistral-7B. KV cache = 1M × 128 KB = 128 GB. Impossible on a single GPU.
At 17x: 7.5 GB. Fits on a single A100 40GB alongside the model weights.
Python Calculator
def kv_cache_gb(
num_layers: int,
num_kv_heads: int,
head_dim: int,
seq_len: int,
dtype_bytes: int = 2, # FP16
) -> float:
"""Returns KV cache size in GB."""
total_bytes = 2 * num_layers * num_kv_heads * head_dim * seq_len * dtype_bytes
return total_bytes / (1024 ** 3)
def savings_report(model_name: str, num_layers: int, num_kv_heads: int,
head_dim: int, seq_len: int, gpu_vram_gb: float,
model_weights_gb: float) -> None:
base_gb = kv_cache_gb(num_layers, num_kv_heads, head_dim, seq_len)
available = gpu_vram_gb - model_weights_gb
print(f"
=== {model_name} @ {seq_len:,} tokens ===")
print(f"Base KV cache: {base_gb:.1f} GB")
print(f"GPU available for KV (after weights): {available:.0f} GB")
for ratio, label in [(10, "high quality"), (17, "balanced"), (33, "max")]:
compressed_gb = base_gb / ratio
concurrent = int(available / compressed_gb)
fits = "YES" if base_gb / ratio < available else "NO"
print(f" {ratio}x ({label}): {compressed_gb:.2f} GB | fits={fits} | concurrent sessions={concurrent}")
# Mistral-7B examples
savings_report("Mistral-7B", 32, 8, 128, 128_000, 80, 14)
savings_report("Mistral-7B", 32, 8, 128, 1_000_000, 80, 14)
# Llama-3-8B
savings_report("Llama-3-8B", 32, 8, 128, 128_000, 40, 16)
# Llama-3-70B (multi-GPU: assume 160 GB total)
savings_report("Llama-3-70B", 80, 8, 128, 128_000, 160, 140)
Sample output:
=== Mistral-7B @ 128,000 tokens ===
Base KV cache: 16.7 GB
GPU available for KV (after weights): 66 GB
10x (high quality): 1.67 GB | fits=YES | concurrent sessions=39
17x (balanced): 0.98 GB | fits=YES | concurrent sessions=67
33x (max): 0.51 GB | fits=YES | concurrent sessions=129
=== Mistral-7B @ 1,000,000 tokens ===
Base KV cache: 130.4 GB
GPU available for KV (after weights): 66 GB
10x (high quality): 13.04 GB | fits=YES | concurrent sessions=5
17x (balanced): 7.67 GB | fits=YES | concurrent sessions=8
33x (max): 3.95 GB | fits=YES | concurrent sessions=16
=== Llama-3-8B @ 128,000 tokens ===
Base KV cache: 16.7 GB
GPU available for KV (after weights): 24 GB
10x (high quality): 1.67 GB | fits=YES | concurrent sessions=14
17x (balanced): 0.98 GB | fits=YES | concurrent sessions=24
33x (max): 0.51 GB | fits=YES | concurrent sessions=47
The Model Weight Overhead
One thing papers often gloss over: the model itself takes memory. Here are real-world numbers:
| Model | Weights (BF16/FP16) |
|---|---|
| TinyLlama-1.1B | 2.2 GB |
| Mistral-7B | 14 GB |
| Llama-3-8B | 16 GB |
| Llama-3-70B | 140 GB (2× A100 80GB) |
| Llama-3-405B | ~810 GB |
At 70B+, KV cache compression becomes less about fitting the model and more about fitting the context. A 70B model with 128K context uses only ~2x its own weight in KV cache. NexusQuant is still valuable here, but the math changes — you're extending context headroom, not primarily enabling the model to load.
What We Measured
All numbers in NexusQuant are validated with torch.cuda.memory_allocated() before and after cache materialization. No estimate-only ratios. The paper's compression column includes all overhead: scales (FP16), lattice indices (packed uint8), and metadata. If it's not in memory, it's not in the ratio.
Repo: github.com/jagmarques/nexusquant
pip install nexusquant-kv
from nexusquant import nexusquant_evict
with nexusquant_evict(model, quality="balanced"): # 17x
output = model.generate(input_ids, max_new_tokens=512)
Best regards, João Marques
Top comments (0)