vLLM OutOfMemoryError with Llama 3.1 70B: 3 Fixes

#vllm #llama #cuda #outofmemoryerror

The Problem Hits at Batch Size 4

Load Llama 3.1 70B on two A100 80GB GPUs with vLLM's default settings, and you'll get about three batches in before CUDA throws OutOfMemoryError. Not gradual slowdown — instant crash.

This isn't a "close the browser tabs" situation. The math doesn't add up: 70B parameters at FP16 is roughly 140GB. Two A100s give you 160GB. That's 20GB headroom for KV cache, which should handle at least 8-10 concurrent requests at 2048 tokens each. But vLLM dies at 4.

The issue shows up in production when you scale from the demo (batch size 1) to actual traffic. Single requests work fine. Queue up five users asking 1500-token questions, and the server crashes.

A close-up photo of a llama enjoying a meal in a sunny outdoor setting. — Photo by tommy picone on Pexels

Why the Default KV Cache Allocation Fails

vLLM pre-allocates GPU memory for the KV cache based on gpu_memory_utilization, which defaults to 0.9. Sounds reasonable — leave 10% free for overhead.

Here's what actually happens. vLLM loads the 140GB model, then tries to reserve 90% of total GPU memory for KV cache blocks. On two 80GB GPUs, that's:

Continue reading the full article on TildAlice

DEV Community

vLLM OutOfMemoryError with Llama 3.1 70B: 3 Fixes

The Problem Hits at Batch Size 4

Why the Default KV Cache Allocation Fails

Top comments (0)