DEV Community

João André Gomes Marques
João André Gomes Marques

Posted on

How Much GPU Memory Does NexusQuant Actually Save?

How Much GPU Memory Does NexusQuant Actually Save?

KV cache compression numbers like "10x" sound impressive in a paper. But what does that mean in practice, for a real GPU, serving real users? Let me give you a concrete memory calculator so you can answer this for your own setup.

The KV Cache Formula

For any transformer, the KV cache size is:

KV_bytes = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element
Enter fullscreen mode Exit fullscreen mode

The 2 is for keys AND values. bytes_per_element is 2 for FP16, 4 for FP32.

For Mistral-7B (32 layers, 8 KV heads, head_dim=128, FP16):

KV_bytes = 2 × 32 × 8 × 128 × seq_len × 2
         = 131,072 × seq_len bytes
         ≈ 128 KB per token
Enter fullscreen mode Exit fullscreen mode

At 128K tokens: 128 KB × 131,072 = 16.7 GB just for the KV cache.

GPU Memory Table

Here's what that means on real hardware:

GPU VRAM Max KV tokens (FP16, no NQ) With NexusQuant 10x With NexusQuant 17x With NexusQuant 33x
RTX 3090 24 GB ~150K ~1.5M ~2.6M ~5M
A10G 24 GB ~150K ~1.5M ~2.6M ~5M
A100 40GB 40 GB ~256K ~2.6M ~4.4M ~8.5M
A100 80GB 80 GB ~512K ~5.1M ~8.7M ~17M
H100 80GB 80 GB ~512K ~5.1M ~8.7M ~17M

Numbers leave 8 GB headroom for model weights and activations on 7B models.

Practical Scenarios

Scenario A: A10G serving 128K context

Without NexusQuant: 128K × 128 KB = 16.7 GB. With only 24 GB total and ~14 GB for the 7B model, you're OOM before the first token generates.

With NexusQuant 17x preset: 16.7 / 17 = 0.98 GB for KV. Fits comfortably. You can now serve 128K context on a single A10G.

Scenario B: A100 80GB, maximize throughput

A single request at 128K takes 16.7 GB KV. With 80 GB total and ~14 GB for weights, you have ~66 GB for KV — enough for about 4 concurrent sessions at 128K.

At 17x compression, each session uses ~1 GB of KV. You can serve ~66 concurrent sessions instead of 4. That's 16x more users per GPU.

Scenario C: Long context research

You want 1M token context on Mistral-7B. KV cache = 1M × 128 KB = 128 GB. Impossible on a single GPU.

At 17x: 7.5 GB. Fits on a single A100 40GB alongside the model weights.

Python Calculator

def kv_cache_gb(
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    seq_len: int,
    dtype_bytes: int = 2,  # FP16
) -> float:
    """Returns KV cache size in GB."""
    total_bytes = 2 * num_layers * num_kv_heads * head_dim * seq_len * dtype_bytes
    return total_bytes / (1024 ** 3)


def savings_report(model_name: str, num_layers: int, num_kv_heads: int,
                   head_dim: int, seq_len: int, gpu_vram_gb: float,
                   model_weights_gb: float) -> None:
    base_gb = kv_cache_gb(num_layers, num_kv_heads, head_dim, seq_len)
    available = gpu_vram_gb - model_weights_gb

    print(f"
=== {model_name} @ {seq_len:,} tokens ===")
    print(f"Base KV cache: {base_gb:.1f} GB")
    print(f"GPU available for KV (after weights): {available:.0f} GB")

    for ratio, label in [(10, "high quality"), (17, "balanced"), (33, "max")]:
        compressed_gb = base_gb / ratio
        concurrent = int(available / compressed_gb)
        fits = "YES" if base_gb / ratio < available else "NO"
        print(f"  {ratio}x ({label}): {compressed_gb:.2f} GB | fits={fits} | concurrent sessions={concurrent}")


# Mistral-7B examples
savings_report("Mistral-7B", 32, 8, 128, 128_000, 80, 14)
savings_report("Mistral-7B", 32, 8, 128, 1_000_000, 80, 14)

# Llama-3-8B
savings_report("Llama-3-8B", 32, 8, 128, 128_000, 40, 16)

# Llama-3-70B (multi-GPU: assume 160 GB total)
savings_report("Llama-3-70B", 80, 8, 128, 128_000, 160, 140)
Enter fullscreen mode Exit fullscreen mode

Sample output:

=== Mistral-7B @ 128,000 tokens ===
Base KV cache: 16.7 GB
GPU available for KV (after weights): 66 GB
  10x (high quality): 1.67 GB | fits=YES | concurrent sessions=39
  17x (balanced): 0.98 GB | fits=YES | concurrent sessions=67
  33x (max): 0.51 GB | fits=YES | concurrent sessions=129

=== Mistral-7B @ 1,000,000 tokens ===
Base KV cache: 130.4 GB
GPU available for KV (after weights): 66 GB
  10x (high quality): 13.04 GB | fits=YES | concurrent sessions=5
  17x (balanced): 7.67 GB | fits=YES | concurrent sessions=8
  33x (max): 3.95 GB | fits=YES | concurrent sessions=16

=== Llama-3-8B @ 128,000 tokens ===
Base KV cache: 16.7 GB
GPU available for KV (after weights): 24 GB
  10x (high quality): 1.67 GB | fits=YES | concurrent sessions=14
  17x (balanced): 0.98 GB | fits=YES | concurrent sessions=24
  33x (max): 0.51 GB | fits=YES | concurrent sessions=47
Enter fullscreen mode Exit fullscreen mode

The Model Weight Overhead

One thing papers often gloss over: the model itself takes memory. Here are real-world numbers:

Model Weights (BF16/FP16)
TinyLlama-1.1B 2.2 GB
Mistral-7B 14 GB
Llama-3-8B 16 GB
Llama-3-70B 140 GB (2× A100 80GB)
Llama-3-405B ~810 GB

At 70B+, KV cache compression becomes less about fitting the model and more about fitting the context. A 70B model with 128K context uses only ~2x its own weight in KV cache. NexusQuant is still valuable here, but the math changes — you're extending context headroom, not primarily enabling the model to load.

What We Measured

All numbers in NexusQuant are validated with torch.cuda.memory_allocated() before and after cache materialization. No estimate-only ratios. The paper's compression column includes all overhead: scales (FP16), lattice indices (packed uint8), and metadata. If it's not in memory, it's not in the ratio.

Repo: github.com/jagmarques/nexusquant

pip install nexusquant-kv
Enter fullscreen mode Exit fullscreen mode
from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):  # 17x
    output = model.generate(input_ids, max_new_tokens=512)
Enter fullscreen mode Exit fullscreen mode

Best regards, João Marques

Top comments (0)