The Problem Hits at Batch Size 4
Load Llama 3.1 70B on two A100 80GB GPUs with vLLM's default settings, and you'll get about three batches in before CUDA throws OutOfMemoryError. Not gradual slowdown — instant crash.
This isn't a "close the browser tabs" situation. The math doesn't add up: 70B parameters at FP16 is roughly 140GB. Two A100s give you 160GB. That's 20GB headroom for KV cache, which should handle at least 8-10 concurrent requests at 2048 tokens each. But vLLM dies at 4.
The issue shows up in production when you scale from the demo (batch size 1) to actual traffic. Single requests work fine. Queue up five users asking 1500-token questions, and the server crashes.
Why the Default KV Cache Allocation Fails
vLLM pre-allocates GPU memory for the KV cache based on gpu_memory_utilization, which defaults to 0.9. Sounds reasonable — leave 10% free for overhead.
Here's what actually happens. vLLM loads the 140GB model, then tries to reserve 90% of total GPU memory for KV cache blocks. On two 80GB GPUs, that's:
$$
Continue reading the full article on TildAlice

Top comments (0)