vLLM Production Setup 2026: Nginx, Auth, Multiple Models

#ai #llm #vllm #docker

This article was originally published on aifoss.dev

TL;DR: This guide turns a single-machine vLLM install into a team-facing API with authentication, nginx routing, and multiple models on separate ports. All of it runs on-prem, costs nothing beyond hardware, and is wire-compatible with the OpenAI Python SDK. The setup takes about 30 minutes and requires familiarity with Docker and nginx.

What you'll have running after this guide:

A Docker Compose stack with vLLM v0.22.0 behind nginx, API key auth, and Prometheus monitoring
Two models served on separate ports and unified under a single nginx endpoint
An OpenAI Python client pointed at your local server, working without any code changes

Honest take: Skip this guide if you're running vLLM for personal use — the basic setup guide gets you serving in 10 minutes. Come back here when more than one person on your team needs API access or you need to run multiple models from one machine.

vLLM v0.22.0, Apache 2.0 license, released May 29, 2026.

What This Adds Over the Basic Setup

The basic vLLM setup gives you a vllm serve process on localhost with no auth and no process manager. That's fine for local experimentation.

A team-facing API needs more:

Feature	Basic Setup	Production Setup (this guide)
Authentication	None — open to anyone on the network	Bearer token via `--api-key` flag
Process management	Manual (`vllm serve` in terminal)	Docker Compose with restart policies
Multiple models	One terminal per model	Separate containers, nginx-routed
SSL termination	No	nginx (cert drop-in ready)
Monitoring	None	Prometheus `/metrics` on same port
Model swapping	Kill and restart	`docker compose up -d --no-deps`

Prerequisites

Docker Engine ≥ 23.0 with Compose V2 (docker compose, not docker-compose)
NVIDIA Container Toolkit ≥ 1.14
NVIDIA driver ≥ 525, CUDA ≥ 12.1
At least 24 GB VRAM per model in FP16; an RTX 4090 (24 GB) handles two 7B–8B models in INT4 simultaneously on a dual-GPU board
A Hugging Face account with model access if you're serving gated models (Llama 3)

If you don't own the hardware yet, RunPod A100 or H100 instances are the fastest way to validate this setup before committing to a purchase. The configuration below runs unmodified on RunPod.

1. Project Layout

Create the working directory:

mkdir vllm-prod && cd vllm-prod
mkdir -p nginx/conf.d monitoring

Final structure:

vllm-prod/
├── docker-compose.yml
├── .env
├── nginx/
│   └── conf.d/
│       └── vllm.conf
└── monitoring/
    └── prometheus.yml

2. Docker Compose Stack

Create docker-compose.yml:

version: "3.9"

services:
  vllm-primary:
    image: vllm/vllm-openai:v0.22.0
    runtime: nvidia
    restart: unless-stopped
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${API_KEY}
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --api-key ${API_KEY}
      --host 0.0.0.0
      --port 8000
      --gpu-memory-utilization 0.85
      --max-model-len 32768
    volumes:
      - hf-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    networks:
      - vllm-net

  vllm-secondary:
    image: vllm/vllm-openai:v0.22.0
    runtime: nvidia
    restart: unless-stopped
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${API_KEY}
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --api-key ${API_KEY}
      --host 0.0.0.0
      --port 8001
      --gpu-memory-utilization 0.85
      --max-model-len 32768
    volumes:
      - hf-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    networks:
      - vllm-net

  nginx:
    image: nginx:1.27-alpine
    restart: unless-stopped
    ports:
      - "80:80"
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d:ro
    depends_on:
      - vllm-primary
      - vllm-secondary
    networks:
      - vllm-net

  prometheus:
    image: prom/prometheus:v3.0.0
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    networks:
      - vllm-net

volumes:
  hf-cache:

networks:
  vllm-net:
    driver: bridge

Note that vllm-primary and vllm-secondary expose no ports directly to the host — all external traffic enters through nginx on port 80. The vLLM containers are reachable only from within the vllm-net Docker network.

Create .env in the same directory:

HF_TOKEN=hf_yourtoken123
API_KEY=change-this-to-a-long-random-string

The API_KEY value is what clients send as Authorization: Bearer <key>. vLLM validates it natively via the --api-key flag — no custom middleware needed.

3. Nginx Config

Create nginx/conf.d/vllm.conf:

upstream llama {
    server vllm-primary:8000;
}

upstream mistral {
    server vllm-secondary:8001;
}

server {
    listen 80;
    server_name _;

    # /llama/* → Llama 3.1 8B on port 8000
    location /llama/ {
        rewrite ^/llama/(.*) /$1 break;
        proxy_pass         http://llama;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering    off;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
    }

    # /mistral/* → Mistral 7B on port 8001
    location /mistral/ {
        rewrite ^/mistral/(.*) /$1 break;
        proxy_pass         http://mistral;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering    off;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
    }

    # Default → primary model
    location / {
        proxy_pass         http://llama;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering    off;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
    }
}

Two things here that aren't obvious:

proxy_buffering off — without this, nginx holds the entire response in memory before forwarding it to the client. For streaming LLM output (Server-Sent Events), that means the user sees nothing until generation is complete. proxy_buffering off lets tokens stream through as they arrive.

proxy_read_timeout 300s — nginx's default is 60 seconds. Long-generation requests (large prompts, high max_tokens) will exceed this and get dropped mid-response. 300s covers most 8B model workloads; push to 600s if you're running 70B models.

4. Prometheus Monitoring

Create monitoring/prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: vllm-primary
    static_configs:
      - targets: ['vllm-primary:8000']
    metrics_path: /metrics

  - job_name: vllm-secondary
    static_configs:
      - targets: ['vllm-secondary:8001']
    metrics_path: /metrics

vLLM exposes Prometheus-formatted metrics at /metrics on the same port as the API — no separate process needed. Key metrics to monitor:

Metric	What it tells you
`vllm:num_requests_running`	Active inference requests — watch for sustained saturation
`vllm:gpu_cache_usage_perc`	KV cache fill percentage — above 95% consistently means reduce `--max-model-len`
`vllm:e2e_request_latency_seconds`	End-to-end latency histogram
`vllm:num_requests_waiting`	Queue depth — a rising value means the GPU is falling behind