DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

vLLM Production Setup 2026: Nginx, Auth, Multiple Models

This article was originally published on aifoss.dev

TL;DR: This guide turns a single-machine vLLM install into a team-facing API with authentication, nginx routing, and multiple models on separate ports. All of it runs on-prem, costs nothing beyond hardware, and is wire-compatible with the OpenAI Python SDK. The setup takes about 30 minutes and requires familiarity with Docker and nginx.

What you'll have running after this guide:

  • A Docker Compose stack with vLLM v0.22.0 behind nginx, API key auth, and Prometheus monitoring
  • Two models served on separate ports and unified under a single nginx endpoint
  • An OpenAI Python client pointed at your local server, working without any code changes

Honest take: Skip this guide if you're running vLLM for personal use — the basic setup guide gets you serving in 10 minutes. Come back here when more than one person on your team needs API access or you need to run multiple models from one machine.

vLLM v0.22.0, Apache 2.0 license, released May 29, 2026.


What This Adds Over the Basic Setup

The basic vLLM setup gives you a vllm serve process on localhost with no auth and no process manager. That's fine for local experimentation.

A team-facing API needs more:

Feature Basic Setup Production Setup (this guide)
Authentication None — open to anyone on the network Bearer token via --api-key flag
Process management Manual (vllm serve in terminal) Docker Compose with restart policies
Multiple models One terminal per model Separate containers, nginx-routed
SSL termination No nginx (cert drop-in ready)
Monitoring None Prometheus /metrics on same port
Model swapping Kill and restart docker compose up -d --no-deps

Prerequisites

  • Docker Engine ≥ 23.0 with Compose V2 (docker compose, not docker-compose)
  • NVIDIA Container Toolkit ≥ 1.14
  • NVIDIA driver ≥ 525, CUDA ≥ 12.1
  • At least 24 GB VRAM per model in FP16; an RTX 4090 (24 GB) handles two 7B–8B models in INT4 simultaneously on a dual-GPU board
  • A Hugging Face account with model access if you're serving gated models (Llama 3)

If you don't own the hardware yet, RunPod A100 or H100 instances are the fastest way to validate this setup before committing to a purchase. The configuration below runs unmodified on RunPod.


1. Project Layout

Create the working directory:

mkdir vllm-prod && cd vllm-prod
mkdir -p nginx/conf.d monitoring
Enter fullscreen mode Exit fullscreen mode

Final structure:

vllm-prod/
├── docker-compose.yml
├── .env
├── nginx/
│   └── conf.d/
│       └── vllm.conf
└── monitoring/
    └── prometheus.yml
Enter fullscreen mode Exit fullscreen mode

2. Docker Compose Stack

Create docker-compose.yml:

version: "3.9"

services:
  vllm-primary:
    image: vllm/vllm-openai:v0.22.0
    runtime: nvidia
    restart: unless-stopped
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${API_KEY}
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --api-key ${API_KEY}
      --host 0.0.0.0
      --port 8000
      --gpu-memory-utilization 0.85
      --max-model-len 32768
    volumes:
      - hf-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    networks:
      - vllm-net

  vllm-secondary:
    image: vllm/vllm-openai:v0.22.0
    runtime: nvidia
    restart: unless-stopped
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${API_KEY}
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --api-key ${API_KEY}
      --host 0.0.0.0
      --port 8001
      --gpu-memory-utilization 0.85
      --max-model-len 32768
    volumes:
      - hf-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    networks:
      - vllm-net

  nginx:
    image: nginx:1.27-alpine
    restart: unless-stopped
    ports:
      - "80:80"
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d:ro
    depends_on:
      - vllm-primary
      - vllm-secondary
    networks:
      - vllm-net

  prometheus:
    image: prom/prometheus:v3.0.0
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    networks:
      - vllm-net

volumes:
  hf-cache:

networks:
  vllm-net:
    driver: bridge
Enter fullscreen mode Exit fullscreen mode

Note that vllm-primary and vllm-secondary expose no ports directly to the host — all external traffic enters through nginx on port 80. The vLLM containers are reachable only from within the vllm-net Docker network.

Create .env in the same directory:

HF_TOKEN=hf_yourtoken123
API_KEY=change-this-to-a-long-random-string
Enter fullscreen mode Exit fullscreen mode

The API_KEY value is what clients send as Authorization: Bearer <key>. vLLM validates it natively via the --api-key flag — no custom middleware needed.


3. Nginx Config

Create nginx/conf.d/vllm.conf:

upstream llama {
    server vllm-primary:8000;
}

upstream mistral {
    server vllm-secondary:8001;
}

server {
    listen 80;
    server_name _;

    # /llama/* → Llama 3.1 8B on port 8000
    location /llama/ {
        rewrite ^/llama/(.*) /$1 break;
        proxy_pass         http://llama;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering    off;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
    }

    # /mistral/* → Mistral 7B on port 8001
    location /mistral/ {
        rewrite ^/mistral/(.*) /$1 break;
        proxy_pass         http://mistral;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering    off;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
    }

    # Default → primary model
    location / {
        proxy_pass         http://llama;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering    off;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
    }
}
Enter fullscreen mode Exit fullscreen mode

Two things here that aren't obvious:

proxy_buffering off — without this, nginx holds the entire response in memory before forwarding it to the client. For streaming LLM output (Server-Sent Events), that means the user sees nothing until generation is complete. proxy_buffering off lets tokens stream through as they arrive.

proxy_read_timeout 300s — nginx's default is 60 seconds. Long-generation requests (large prompts, high max_tokens) will exceed this and get dropped mid-response. 300s covers most 8B model workloads; push to 600s if you're running 70B models.


4. Prometheus Monitoring

Create monitoring/prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: vllm-primary
    static_configs:
      - targets: ['vllm-primary:8000']
    metrics_path: /metrics

  - job_name: vllm-secondary
    static_configs:
      - targets: ['vllm-secondary:8001']
    metrics_path: /metrics
Enter fullscreen mode Exit fullscreen mode

vLLM exposes Prometheus-formatted metrics at /metrics on the same port as the API — no separate process needed. Key metrics to monitor:

Metric What it tells you
vllm:num_requests_running Active inference requests — watch for sustained saturation
vllm:gpu_cache_usage_perc KV cache fill percentage — above 95% consistently means reduce --max-model-len
vllm:e2e_request_latency_seconds End-to-end latency histogram
vllm:num_requests_waiting Queue depth — a rising value means the GPU is falling behind

Verify the metrics endp

Top comments (0)