This article was originally published on aifoss.dev
TL;DR: This guide turns a single-machine vLLM install into a team-facing API with authentication, nginx routing, and multiple models on separate ports. All of it runs on-prem, costs nothing beyond hardware, and is wire-compatible with the OpenAI Python SDK. The setup takes about 30 minutes and requires familiarity with Docker and nginx.
What you'll have running after this guide:
- A Docker Compose stack with vLLM v0.22.0 behind nginx, API key auth, and Prometheus monitoring
- Two models served on separate ports and unified under a single nginx endpoint
- An OpenAI Python client pointed at your local server, working without any code changes
Honest take: Skip this guide if you're running vLLM for personal use — the basic setup guide gets you serving in 10 minutes. Come back here when more than one person on your team needs API access or you need to run multiple models from one machine.
vLLM v0.22.0, Apache 2.0 license, released May 29, 2026.
What This Adds Over the Basic Setup
The basic vLLM setup gives you a vllm serve process on localhost with no auth and no process manager. That's fine for local experimentation.
A team-facing API needs more:
| Feature | Basic Setup | Production Setup (this guide) |
|---|---|---|
| Authentication | None — open to anyone on the network | Bearer token via --api-key flag |
| Process management | Manual (vllm serve in terminal) |
Docker Compose with restart policies |
| Multiple models | One terminal per model | Separate containers, nginx-routed |
| SSL termination | No | nginx (cert drop-in ready) |
| Monitoring | None | Prometheus /metrics on same port |
| Model swapping | Kill and restart | docker compose up -d --no-deps |
Prerequisites
- Docker Engine ≥ 23.0 with Compose V2 (
docker compose, notdocker-compose) - NVIDIA Container Toolkit ≥ 1.14
- NVIDIA driver ≥ 525, CUDA ≥ 12.1
- At least 24 GB VRAM per model in FP16; an RTX 4090 (24 GB) handles two 7B–8B models in INT4 simultaneously on a dual-GPU board
- A Hugging Face account with model access if you're serving gated models (Llama 3)
If you don't own the hardware yet, RunPod A100 or H100 instances are the fastest way to validate this setup before committing to a purchase. The configuration below runs unmodified on RunPod.
1. Project Layout
Create the working directory:
mkdir vllm-prod && cd vllm-prod
mkdir -p nginx/conf.d monitoring
Final structure:
vllm-prod/
├── docker-compose.yml
├── .env
├── nginx/
│ └── conf.d/
│ └── vllm.conf
└── monitoring/
└── prometheus.yml
2. Docker Compose Stack
Create docker-compose.yml:
version: "3.9"
services:
vllm-primary:
image: vllm/vllm-openai:v0.22.0
runtime: nvidia
restart: unless-stopped
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
- VLLM_API_KEY=${API_KEY}
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--api-key ${API_KEY}
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.85
--max-model-len 32768
volumes:
- hf-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
networks:
- vllm-net
vllm-secondary:
image: vllm/vllm-openai:v0.22.0
runtime: nvidia
restart: unless-stopped
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
- VLLM_API_KEY=${API_KEY}
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--api-key ${API_KEY}
--host 0.0.0.0
--port 8001
--gpu-memory-utilization 0.85
--max-model-len 32768
volumes:
- hf-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
networks:
- vllm-net
nginx:
image: nginx:1.27-alpine
restart: unless-stopped
ports:
- "80:80"
volumes:
- ./nginx/conf.d:/etc/nginx/conf.d:ro
depends_on:
- vllm-primary
- vllm-secondary
networks:
- vllm-net
prometheus:
image: prom/prometheus:v3.0.0
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
networks:
- vllm-net
volumes:
hf-cache:
networks:
vllm-net:
driver: bridge
Note that vllm-primary and vllm-secondary expose no ports directly to the host — all external traffic enters through nginx on port 80. The vLLM containers are reachable only from within the vllm-net Docker network.
Create .env in the same directory:
HF_TOKEN=hf_yourtoken123
API_KEY=change-this-to-a-long-random-string
The API_KEY value is what clients send as Authorization: Bearer <key>. vLLM validates it natively via the --api-key flag — no custom middleware needed.
3. Nginx Config
Create nginx/conf.d/vllm.conf:
upstream llama {
server vllm-primary:8000;
}
upstream mistral {
server vllm-secondary:8001;
}
server {
listen 80;
server_name _;
# /llama/* → Llama 3.1 8B on port 8000
location /llama/ {
rewrite ^/llama/(.*) /$1 break;
proxy_pass http://llama;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
proxy_buffering off;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
# /mistral/* → Mistral 7B on port 8001
location /mistral/ {
rewrite ^/mistral/(.*) /$1 break;
proxy_pass http://mistral;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
proxy_buffering off;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
# Default → primary model
location / {
proxy_pass http://llama;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
proxy_buffering off;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
Two things here that aren't obvious:
proxy_buffering off — without this, nginx holds the entire response in memory before forwarding it to the client. For streaming LLM output (Server-Sent Events), that means the user sees nothing until generation is complete. proxy_buffering off lets tokens stream through as they arrive.
proxy_read_timeout 300s — nginx's default is 60 seconds. Long-generation requests (large prompts, high max_tokens) will exceed this and get dropped mid-response. 300s covers most 8B model workloads; push to 600s if you're running 70B models.
4. Prometheus Monitoring
Create monitoring/prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: vllm-primary
static_configs:
- targets: ['vllm-primary:8000']
metrics_path: /metrics
- job_name: vllm-secondary
static_configs:
- targets: ['vllm-secondary:8001']
metrics_path: /metrics
vLLM exposes Prometheus-formatted metrics at /metrics on the same port as the API — no separate process needed. Key metrics to monitor:
| Metric | What it tells you |
|---|---|
vllm:num_requests_running |
Active inference requests — watch for sustained saturation |
vllm:gpu_cache_usage_perc |
KV cache fill percentage — above 95% consistently means reduce --max-model-len
|
vllm:e2e_request_latency_seconds |
End-to-end latency histogram |
vllm:num_requests_waiting |
Queue depth — a rising value means the GPU is falling behind |
Verify the metrics endp
Top comments (0)