This article was originally published on aifoss.dev
TL;DR: By the end of this guide you'll be serving any Hugging Face model through a local OpenAI-compatible API endpoint. vLLM v0.21.0 handles the heavy lifting — PagedAttention, continuous batching, multi-GPU tensor parallelism — once you give it the right flags. The bottleneck is VRAM, not setup complexity.
What you'll have running after this guide:
- A vLLM server on
http://localhost:8000/v1accepting standard OpenAI API requests - A tested endpoint you can point any OpenAI Python client, LangChain app, or chat frontend at
- Optional: Docker-based deployment with API-key authentication for team use
Honest take:
pip install vllmis genuinely that simple. The decision points are which GPU flags to set for your hardware and whether to go bare-metal or Docker. Both paths are below.
vLLM is open-source under the Apache 2.0 license — no usage restrictions for commercial or internal deployments.
Prerequisites
Before you start, check your hardware against what vLLM can actually serve:
| Setup | GPU VRAM | System RAM | What runs well |
|---|---|---|---|
| Entry point | 16 GB (RTX 3090) | 32 GB | Llama 3.2 3B (FP16), Mistral 7B |
| Comfortable | 24 GB | 64 GB | Llama 3.1 8B (FP16), Qwen2.5 14B |
| Multi-user API | 48 GB+ (2× RTX 4090) | 128 GB | Llama 3.1 70B (INT4), DeepSeek V3-lite |
| Cloud alternative | — | — | RunPod A100 or H100 on-demand |
If your GPU VRAM is under 16 GB, you'll be constrained to smaller models or quantized variants — that's fine, just set --dtype float16 and --max-model-len conservatively (covered in Step 4).
Software requirements:
- Linux — Ubuntu 22.04 or 24.04 recommended. macOS and Windows are not supported by vLLM.
- Python 3.10–3.14 (Python 3.12 is the sweet spot; tested and well-supported)
- NVIDIA GPU with CUDA 12.4 and driver 550+. Verify:
nvidia-smi - At least 50 GB free disk space for model weights
Not on Linux? Use Ollama instead — it supports macOS and Windows natively with near-identical API compatibility. See Ollama vs vLLM 2026 for the full comparison.
Step 1: Install vLLM
Use a fresh Python virtual environment. vLLM compiles CUDA kernels on install, and those kernels are tied to a specific PyTorch and CUDA version. Mixing vLLM into an existing environment causes version conflicts that are hard to debug.
python3 -m venv ~/.venvs/vllm
source ~/.venvs/vllm/bin/activate
pip install vllm
The wheel bundles PyTorch 2.11 and pre-compiled CUDA kernels. Expect a 5–10 minute install on a fresh environment — most of the time is download, not compilation.
Verify the install:
python -c "import vllm; print(vllm.__version__)"
# 0.21.0
If you'd rather use conda to manage the environment:
conda create -n vllm-env python=3.12 -y
conda activate vllm-env
pip install vllm # still use pip, not conda install
The conda environment for isolation is fine; the package itself must come from pip because the conda-forge build lags behind and often has NCCL conflicts with multi-GPU setups.
Step 2: Serve your first model
The vllm serve command starts an HTTP server. Pass any Hugging Face model ID:
vllm serve meta-llama/Llama-3.2-3B-Instruct
On first run, the model downloads to ~/.cache/huggingface/hub/. Subsequent runs load from cache — fast.
The server binds to http://localhost:8000/v1 by default, matching the OpenAI API base URL. You'll see startup output like this when it's ready:
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
For gated models (Llama 3.1, Llama 3.3, and similar Meta models), you need a Hugging Face access token. Get one at huggingface.co/settings/tokens, accept the model's terms on the model page, then:
export HF_TOKEN=hf_your_token_here
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct
Step 3: Test the endpoint
The /v1/chat/completions endpoint is OpenAI-compatible. Test with curl before touching any application code:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is PagedAttention?"}],
"max_tokens": 200
}'
The model field must match exactly what you passed to vllm serve — including the full HuggingFace path. Check what the server sees as its loaded model:
curl http://localhost:8000/v1/models
With the OpenAI Python client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-used", # any non-empty string works when --api-key is not set
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Summarize vLLM in two sentences."}],
max_tokens=200,
)
print(response.choices[0].message.content)
Streaming works identically to the OpenAI SDK — pass stream=True and iterate response. vLLM also supports the /v1/completions (legacy text completion) and /v1/embeddings endpoints.
Step 4: Key configuration flags
The defaults work out of the box for most single-user setups. These flags matter when you hit limits or serve real traffic.
GPU memory utilization
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.90
Default is 0.90 (90% of VRAM reserved for vLLM). Raise to 0.95 if a model barely fits. Lower to 0.80 if other processes share the GPU. The remaining headroom prevents OOM errors from CUDA context overhead.
Data type
vllm serve ... --dtype float16
auto (default) picks bfloat16 on Ampere+ GPUs. If you hit RuntimeError: No GPU memory available, switch to float16 — it cuts memory usage by ~5% versus bfloat16 at marginal quality loss. float32 is almost never what you want; it doubles VRAM usage.
Context window cap
vllm serve ... --max-model-len 8192
vLLM defaults to the model's maximum context length (often 128k for modern models). That maximum context length determines how much KV cache VRAM vLLM pre-allocates — even for requests that only use 2k tokens. Setting --max-model-len 8192 reclaims significant VRAM and increases the number of concurrent requests the server can handle. Use the longest context your actual use case needs, not the model's theoretical maximum.
Host and port
vllm serve ... --host 0.0.0.0 --port 8080
--host 0.0.0.0 exposes the server on all network interfaces, required for Docker containers and remote team access. The default 127.0.0.1 is localhost-only.
API key authentication
vllm serve ... --api-key your-secret-key-here
Once set, all requests must include:
Authorization: Bearer your-secret-key-here
The OpenAI Python client handles this automatically when you pass api_key="your-secret-key-here" to the OpenAI() constructor.
Step 5: Multi-GPU tensor parallelism
For models larger than a single GPU's VRAM, vLLM distributes the model across GPUs via tensor parallelism. The --tensor-parallel-size value must divide evenly into the model's attention head count — for most models, 2, 4, or 8 GPUs work.
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--max-model-len 32768
Two RTX 4090s (48 GB combined VRAM) can serve Llama 3.1 70B in bfloat16 at this context length. Check GPU visibility first:
nvidia-smi
Multi-GPU containers need extra shared memory for NCCL communication — handled in the Docker step below. If you're running bare-metal, vLLM manages this automatically.
Step 6: Docker deployment
For repeatabl
Top comments (0)