GPU memory is the binding constraint for LLM deployment. The model's parameters must reside in VRAM alongside everything the runtime needs: the key-value cache, intermediate activations, and the serving framework's own buffers. Getting this budget wrong in either direction has real consequences. Underprovisioning leads to OOM crashes under load.
Overprovisioning means paying for VRAM that sits idle, and the difference between a two-GPU and four-GPU configuration is $2,000-4,000 per month.
The weight formula
Memory (GB) = Parameters (B) x Bytes per Parameter
These numbers cover weights only. In practice, you need an additional 20-40% for the KV cache, activations, and framework overhead.
The KV cache is where teams underestimate
Model weights are the predictable part. What makes GPU sizing deceptive is the key-value cache: for each concurrent request, the model stores key and value vectors for every token in the sequence, and this cache grows linearly with context length and batch size.
A 70B model's weights might fit comfortably on two A100 80GB GPUs. But add KV cache for 32K context across 8 concurrent requests and you need another 40+ GB on top of that.
Quick sizing reference
Approximate requirements in FP16 with vLLM, batch size 8, 4K context:
Quantization is the single most impactful lever. INT4 cuts memory by 75% compared to FP16, and for most production inference tasks the quality difference is negligible.
Calculate it for your workload
The formulas above work for back-of-envelope estimates, but real workloads involve specific batch sizes, context distributions, and throughput targets. We built a GPU sizing calculator that estimates VRAM, throughput, and latency using a roofline model validated against vLLM benchmarks.
The methodology is public if you want to verify the assumptions. The model catalog lets you filter and compare across providers on pricing, benchmarks, and capabilities.


Top comments (0)