Most people dismiss million-token context windows as a hardware problem. It is not. It is a math problem — and the math has a solution.
The Raw Numbers
A 70B model stores KV cache at 2 bytes per element (fp16). With 96 layers, 64 heads, 128 head-dim, the KV cache per token is:
bytes_per_token = 2 * num_layers * 2 * num_heads * head_dim * bytes_per_element
= 2 * 96 * 2 * 64 * 128 * 2
= 6,291,456 bytes ≈ 6 MB/token
At 1M tokens: 6 TB. Two H100s hold 160 GB combined. You are 37× short.
The Compression Table
| Model | Context | No compression | 5x | 10x | 17x | 33x |
|---|---|---|---|---|---|---|
| 7B | 1M tokens | 420 GB | 84 GB | 42 GB | 25 GB | 13 GB |
| 13B | 1M tokens | 780 GB | 156 GB | 78 GB | 46 GB | 24 GB |
| 70B | 1M tokens | 6,000 GB | 1,200 GB | 600 GB | 120 GB | 60 GB |
| 70B | 128K tokens | 768 GB | 154 GB | 77 GB | 45 GB | 23 GB |
17× compression: 70B at 1M tokens fits on 2× H100 (120 GB).
33× compression: 70B at 1M tokens fits on a single H100 (80 GB).
The Python Formula
def kv_cache_gb(
model_params_b, # e.g. 70 for 70B
context_length, # e.g. 1_000_000
compression_ratio=1, # NexusQuant preset
bytes_per_element=2, # fp16
):
# Approximate KV bytes from model size
# Rule of thumb: KV cache ≈ model_params * 0.375 * (ctx / training_ctx)
# Precise version:
num_layers = int(model_params_b ** 0.45 * 5.2) # empirical fit
num_kv_heads = 8 # GQA default for modern 70B
head_dim = 128
kv_bytes = (
context_length
* num_layers
* 2 # K and V
* num_kv_heads
* head_dim
* bytes_per_element
)
return kv_bytes / compression_ratio / 1e9
# Examples
print(kv_cache_gb(70, 1_000_000, compression_ratio=17)) # ~120 GB
print(kv_cache_gb(70, 1_000_000, compression_ratio=33)) # ~60 GB
print(kv_cache_gb(7, 1_000_000, compression_ratio=5)) # ~84 GB
What This Means in Practice
NexusQuant presets map directly to GPU configurations:
- Preset S (5×): 7B model, 1M context → single H100
- Preset M (10×): 13B model, 1M context → single H100
- Preset L (17×): 70B model, 1M context → 2× H100
- Preset XL (33×): 70B model, 1M context → 1× H100
The bottleneck was never the model weights. It was always the KV cache. The math is solved.
with nexusquant(model, preset="L"): # 17x, -0.03% quality
output = model.generate(million_token_prompt)
Best regards, João Marques
Top comments (0)