Jovan Chan

Posted on Jun 1 • Originally published at aifoss.dev

vLLM Setup Guide 2026: Serve Any LLM via OpenAI API

#linux #selfhosted #ai #opensource

This article was originally published on aifoss.dev

TL;DR: By the end of this guide you'll be serving any Hugging Face model through a local OpenAI-compatible API endpoint. vLLM v0.21.0 handles the heavy lifting — PagedAttention, continuous batching, multi-GPU tensor parallelism — once you give it the right flags. The bottleneck is VRAM, not setup complexity.

What you'll have running after this guide:

A vLLM server on http://localhost:8000/v1 accepting standard OpenAI API requests
A tested endpoint you can point any OpenAI Python client, LangChain app, or chat frontend at
Optional: Docker-based deployment with API-key authentication for team use

Honest take: pip install vllm is genuinely that simple. The decision points are which GPU flags to set for your hardware and whether to go bare-metal or Docker. Both paths are below.

vLLM is open-source under the Apache 2.0 license — no usage restrictions for commercial or internal deployments.

Prerequisites

Before you start, check your hardware against what vLLM can actually serve:

Setup	GPU VRAM	System RAM	What runs well
Entry point	16 GB (RTX 3090)	32 GB	Llama 3.2 3B (FP16), Mistral 7B
Comfortable	24 GB	64 GB	Llama 3.1 8B (FP16), Qwen2.5 14B
Multi-user API	48 GB+ (2× RTX 4090)	128 GB	Llama 3.1 70B (INT4), DeepSeek V3-lite
Cloud alternative	—	—	RunPod A100 or H100 on-demand

If your GPU VRAM is under 16 GB, you'll be constrained to smaller models or quantized variants — that's fine, just set --dtype float16 and --max-model-len conservatively (covered in Step 4).

Software requirements:

Linux — Ubuntu 22.04 or 24.04 recommended. macOS and Windows are not supported by vLLM.
Python 3.10–3.14 (Python 3.12 is the sweet spot; tested and well-supported)
NVIDIA GPU with CUDA 12.4 and driver 550+. Verify: nvidia-smi
At least 50 GB free disk space for model weights

Not on Linux? Use Ollama instead — it supports macOS and Windows natively with near-identical API compatibility. See Ollama vs vLLM 2026 for the full comparison.

Step 1: Install vLLM

Use a fresh Python virtual environment. vLLM compiles CUDA kernels on install, and those kernels are tied to a specific PyTorch and CUDA version. Mixing vLLM into an existing environment causes version conflicts that are hard to debug.

python3 -m venv ~/.venvs/vllm
source ~/.venvs/vllm/bin/activate

pip install vllm

The wheel bundles PyTorch 2.11 and pre-compiled CUDA kernels. Expect a 5–10 minute install on a fresh environment — most of the time is download, not compilation.

Verify the install:

python -c "import vllm; print(vllm.__version__)"
# 0.21.0

If you'd rather use conda to manage the environment:

conda create -n vllm-env python=3.12 -y
conda activate vllm-env
pip install vllm   # still use pip, not conda install

The conda environment for isolation is fine; the package itself must come from pip because the conda-forge build lags behind and often has NCCL conflicts with multi-GPU setups.

Step 2: Serve your first model

The vllm serve command starts an HTTP server. Pass any Hugging Face model ID:

vllm serve meta-llama/Llama-3.2-3B-Instruct

On first run, the model downloads to ~/.cache/huggingface/hub/. Subsequent runs load from cache — fast.

The server binds to http://localhost:8000/v1 by default, matching the OpenAI API base URL. You'll see startup output like this when it's ready:

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

For gated models (Llama 3.1, Llama 3.3, and similar Meta models), you need a Hugging Face access token. Get one at huggingface.co/settings/tokens, accept the model's terms on the model page, then:

export HF_TOKEN=hf_your_token_here
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct

Step 3: Test the endpoint

The /v1/chat/completions endpoint is OpenAI-compatible. Test with curl before touching any application code:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is PagedAttention?"}],
    "max_tokens": 200
  }'

The model field must match exactly what you passed to vllm serve — including the full HuggingFace path. Check what the server sees as its loaded model:

curl http://localhost:8000/v1/models

With the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-used",  # any non-empty string works when --api-key is not set
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Summarize vLLM in two sentences."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

Streaming works identically to the OpenAI SDK — pass stream=True and iterate response. vLLM also supports the /v1/completions (legacy text completion) and /v1/embeddings endpoints.

Step 4: Key configuration flags

The defaults work out of the box for most single-user setups. These flags matter when you hit limits or serve real traffic.

GPU memory utilization

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.90

Default is 0.90 (90% of VRAM reserved for vLLM). Raise to 0.95 if a model barely fits. Lower to 0.80 if other processes share the GPU. The remaining headroom prevents OOM errors from CUDA context overhead.

Data type

vllm serve ... --dtype float16

auto (default) picks bfloat16 on Ampere+ GPUs. If you hit RuntimeError: No GPU memory available, switch to float16 — it cuts memory usage by ~5% versus bfloat16 at marginal quality loss. float32 is almost never what you want; it doubles VRAM usage.

Context window cap

vllm serve ... --max-model-len 8192

vLLM defaults to the model's maximum context length (often 128k for modern models). That maximum context length determines how much KV cache VRAM vLLM pre-allocates — even for requests that only use 2k tokens. Setting --max-model-len 8192 reclaims significant VRAM and increases the number of concurrent requests the server can handle. Use the longest context your actual use case needs, not the model's theoretical maximum.

Host and port

vllm serve ... --host 0.0.0.0 --port 8080

--host 0.0.0.0 exposes the server on all network interfaces, required for Docker containers and remote team access. The default 127.0.0.1 is localhost-only.

API key authentication

vllm serve ... --api-key your-secret-key-here

Once set, all requests must include:

Authorization: Bearer your-secret-key-here

The OpenAI Python client handles this automatically when you pass api_key="your-secret-key-here" to the OpenAI() constructor.

Step 5: Multi-GPU tensor parallelism

For models larger than a single GPU's VRAM, vLLM distributes the model across GPUs via tensor parallelism. The --tensor-parallel-size value must divide evenly into the model's attention head count — for most models, 2, 4, or 8 GPUs work.

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --max-model-len 32768

Two RTX 4090s (48 GB combined VRAM) can serve Llama 3.1 70B in bfloat16 at this context length. Check GPU visibility first:

nvidia-smi

Multi-GPU containers need extra shared memory for NCCL communication — handled in the Docker step below. If you're running bare-metal, vLLM manages this automatically.

Step 6: Docker deployment

For repeatabl

DEV Community