DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

vLLM Won't Start? Every Fix for the Engine Init, CUDA, and OOM Errors (2026)

This article was originally published on runaihome.com

TL;DR: Most vLLM startup failures are one of three things: the engine reserves more KV-cache memory than your card has (No available memory for the cache blocks), the CUDA driver is older than the wheel was built for (The NVIDIA driver on your system is too old), or a multi-GPU run hangs at NCCL init. The fixes are nearly always flags, not code: pin --max-model-len, tune --gpu-memory-utilization, add --enforce-eager, or set a couple of NCCL env vars. Read the last line of the traceback first — it tells you which of the three you have.

What you'll be able to do after this:

  • Read a vLLM startup traceback and know in one glance whether it's a KV-cache/OOM problem, a driver/CUDA problem, or a multi-GPU networking hang.
  • Apply the exact flag or environment variable that fixes each class, with the values that actually work on 12–24 GB consumer cards.
  • Stop guessing from nvidia-smi — which lies about how much memory vLLM can actually use — and trust the startup log instead.

Honest take: vLLM is a server engine, not a desktop app. If you just want a model running on one consumer GPU with the least friction, Ollama or LM Studio will get you there faster. Reach for vLLM when you need throughput under concurrency — many requests at once — and you're willing to learn three flags. Once you know those flags, 90% of the "it won't start" pain disappears.

This guide assumes vLLM v0.23.0 (released June 13, 2026), which ships on PyTorch 2.11 with the default PyPI wheel now built for CUDA 13.0 and Python 3.14 added to the supported list. Older forum threads reference very different defaults, so the version tag matters when you're copying fixes from 2024–2025 posts.

Step 0: Read the actual error, not the wall of logs

vLLM prints a lot of output on startup — model download progress, worker spawn messages, CUDA graph capture. None of that is the error. The error is the last Python traceback, and specifically its final line. Three lines account for the overwhelming majority of "vLLM won't start" reports:

The line you see What it actually means Jump to
ValueError: No available memory for the cache blocks KV cache doesn't fit after weights load OOM section
RuntimeError: The NVIDIA driver on your system is too old Wheel built for newer CUDA than your driver Driver section
Hangs forever after Started a worker / NCCL lines Multi-GPU collective setup stuck NCCL section

If you can't tell which bucket you're in, restart with debug logging on and capture the tail:

VLLM_LOGGING_LEVEL=DEBUG vllm serve Qwen/Qwen2.5-7B-Instruct 2>&1 | tee vllm.log
Enter fullscreen mode Exit fullscreen mode

The DEBUG level is documented in vLLM's own troubleshooting guide and is the single most useful thing you can do before asking anyone for help.

The #1 startup error: "No available memory for the cache blocks"

This is the error people hit first, and it's been the top vLLM startup complaint since at least issue #2248. The full message reads:

ValueError: No available memory for the cache blocks. Try increasing
`gpu_memory_utilization` when initializing the engine.
Enter fullscreen mode Exit fullscreen mode

Why it happens

vLLM loads the model weights first, then tries to carve the remaining VRAM into KV-cache blocks. The KV cache is sized from --max-model-len (the maximum sequence length) and the number of concurrent sequences. If the weights plus the requested context budget exceed your card, there's nothing left for blocks, and the engine refuses to start rather than crash mid-request.

The trap: vLLM's default --max-model-len is the model's full trained context — often 32K or higher. A 7B model at Q4 might be ~4.5 GB of weights, but a 32K context KV cache for several parallel sequences can dwarf that. On a 12 GB card the math simply doesn't close. This is exactly the failure mode reported for 7B–13B models on the RTX 3060 12GB in issue #27934 — and it affects every Ampere 12 GB card, not just the 3060.

The fix, in the order to try it

1. Pin --max-model-len to what you actually need. This is the highest-leverage fix and most people skip it. If your prompts are 4K, don't pay for 32K of KV cache:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85
Enter fullscreen mode Exit fullscreen mode

2. Raise --gpu-memory-utilization. Default is 0.9. The error message tells you to increase it, and that's legitimate — it lets vLLM claim a larger slice of the card for blocks. On a dedicated inference box, 0.920.95 is reasonable. But on a card that's also driving a display, going too high starves the desktop and can crash X. On 12 GB cards, counterintuitively, lowering it to 0.750.80 sometimes fixes init OOMs because it leaves more headroom for the CUDA context and fragmentation overhead that the allocator needs up front.

3. Cap concurrency with --max-num-seqs. Fewer simultaneous sequences means a smaller KV-cache budget. Dropping from the default to --max-num-seqs 16 (or 8) frees real memory on tight cards.

4. Quantize the KV cache. --kv-cache-dtype fp8 roughly halves KV-cache memory at a small quality cost — often the difference between fitting and not on 16 GB.

5. Add --enforce-eager. CUDA graph capture pre-allocates extra memory. Disabling it with --enforce-eager reclaims a few hundred MiB — useful as a last 300–500 MiB when you need context length more than peak throughput.

A combined low-VRAM launch that works on most 12 GB cards:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 8 \
  --kv-cache-dtype fp8 \
  --enforce-eager
Enter fullscreen mode Exit fullscreen mode

Don't trust nvidia-smi here

A subtle point that wastes hours: nvidia-smi reports driver-level reserved memory, not the segments the CUDA allocator can actually hand to vLLM. vLLM's block allocator queries CUDA directly and can OOM even when nvidia-smi shows a couple of GB "free." When the two disagree, trust vLLM's startup log, not the system monitor.

If — and only if — the log explicitly mentions fragmentation, add:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Enter fullscreen mode Exit fullscreen mode

Don't set this reflexively. It's a fix for fragmentation specifically, not a magic OOM cure, and the same PYTORCH_CUDA_ALLOC_CONF knob shows up across the broader CUDA out-of-memory fix guide for Ollama, llama.cpp, and ComfyUI too.

"The NVIDIA driver on your system is too old"

The second-most-common wall, especially right after a fresh pip install vllm:

RuntimeError: The NVIDIA driver on your system is too old (found version XXXX).
Enter fullscreen mode Exit fullscreen mode

Why it happens

vLLM wheels are compiled against a specific CUDA toolkit. As of v0.23.0 the default PyPI wheel targets CUDA 13.0. If your installed driver predates that toolkit, the compiled kernels can't run. This bites people on stable LTS distros and on cloud images that pin older drivers.

The fixes

Option A — enable CUDA forward compatibility (no driver upgrade). If you're on the official vLLM Docker image, add:

docker run --gpus all -e VLLM_ENABLE_CUDA_COMPATIBILITY=1 ... vllm/vllm-openai:latest
Enter fullscreen mode Exit fullscreen mode

Outside Docker, install the matching cuda-compat package and point vLLM at it:

sudo apt-get install cuda-compat-13-0
export VLLM_ENABLE_CUDA_COMPATIBILITY=1
export VLLM_CUDA_COMPATIBILITY_PATH=/usr/local/cuda/compat
Enter fullscreen mode Exit fullscreen mode

Option B — upgrade the driver. The cleaner long-term fix. Match your driver to the CUDA 13.0 toolkit minimum. On WSL2, install the latest Game Ready or Studio driver on the Windows host — never install a Linux GPU driver inside the WSL distro, which is the same trap that breaks Ollama GPU detection.

**Option C — install a wheel

Top comments (0)