Running 1M-token context on a single GPU (the math)

#ai #gpu #llm #infrastructure

Most people dismiss million-token context windows as a hardware problem. It is not. It is a math problem - and the math has a solution.

The Raw Numbers

A 70B model stores KV cache at 2 bytes per element (fp16). With 96 layers, 64 heads, 128 head-dim, the KV cache per token is:

bytes_per_token = 2 * num_layers * 2 * num_heads * head_dim * bytes_per_element
                = 2 * 96 * 2 * 64 * 128 * 2
                = 6,291,456 bytes ≈ 6 MB/token

At 1M tokens: 6 TB. Two H100s hold 160 GB combined. You are 37× short.

The Compression Table

Model	Context	No compression	5x	10x	17x	33x
7B	1M tokens	420 GB	84 GB	42 GB	25 GB	13 GB
13B	1M tokens	780 GB	156 GB	78 GB	46 GB	24 GB
70B	1M tokens	6,000 GB	1,200 GB	600 GB	120 GB	60 GB
70B	128K tokens	768 GB	154 GB	77 GB	45 GB	23 GB

17× compression: 70B at 1M tokens fits on 2× H100 (120 GB).
33× compression: 70B at 1M tokens fits on a single H100 (80 GB).

The Python Formula

def kv_cache_gb(
    model_params_b,      # e.g. 70 for 70B
    context_length,      # e.g. 1_000_000
    compression_ratio=1, # NexusQuant preset
    bytes_per_element=2, # fp16
):
    # Approximate KV bytes from model size
    # Rule of thumb: KV cache ≈ model_params * 0.375 * (ctx / training_ctx)
    # Precise version:
    num_layers = int(model_params_b ** 0.45 * 5.2)  # empirical fit
    num_kv_heads = 8   # GQA default for modern 70B
    head_dim = 128
    kv_bytes = (
        context_length
        * num_layers
        * 2  # K and V
        * num_kv_heads
        * head_dim
        * bytes_per_element
    )
    return kv_bytes / compression_ratio / 1e9

# Examples
print(kv_cache_gb(70, 1_000_000, compression_ratio=17))   # ~120 GB
print(kv_cache_gb(70, 1_000_000, compression_ratio=33))   # ~60 GB
print(kv_cache_gb(7,  1_000_000, compression_ratio=5))    # ~84 GB

What This Means in Practice

NexusQuant presets map directly to GPU configurations:

Preset S (5×): 7B model, 1M context → single H100
Preset M (10×): 13B model, 1M context → single H100
Preset L (17×): 70B model, 1M context → 2× H100
Preset XL (33×): 70B model, 1M context → 1× H100

The bottleneck was never the model weights. It was always the KV cache. The math is solved.

with nexusquant(model, preset="L"):  # 17x, -0.03% quality
    output = model.generate(million_token_prompt)

Best regards, João Marques

DEV Community

Running 1M-token context on a single GPU (the math)

The Raw Numbers

The Compression Table

The Python Formula

What This Means in Practice

Top comments (0)