DEV Community

Cover image for How I Deployed a RAG Engine to Production with Docker, Nginx and DigitalOcean
Martin Palopoli
Martin Palopoli

Posted on

How I Deployed a RAG Engine to Production with Docker, Nginx and DigitalOcean

I deployed a full RAG engine (FastAPI + PostgreSQL + pgvector + Redis) on a 4GB RAM VPS for $24/month. This article covers the real deployment architecture: Docker multi-stage builds, PostgreSQL tuned for limited resources, Nginx as reverse proxy with SSE support, zero-downtime deploys with maintenance mode, automated backups and cron monitoring.


The Context

In the previous article I built a production RAG pipeline with hybrid search, cross-encoder reranking and semantic cache. Everything worked perfectly in local Docker.

The problem: getting it to production on a budget VPS without it exploding.

A RAG system isn't a typical CRUD app. It has:

  • Embedding models that consume ~500MB of RAM per worker
  • PostgreSQL with heavy extensions (pgvector + HNSW indexes)
  • SSE streaming that needs long-lived connections
  • Redis for rate limiting and cache
  • All of that competing for 4GB of RAM

Chosen Infrastructure

Component Specification
VPS DigitalOcean 4GB RAM / 2 vCPU / 80GB SSD
OS Ubuntu 24.04 LTS
Containers Docker Compose (5 services)
SSL Cloudflare (proxy + origin certificates)
DNS Cloudflare
Total cost ~$24/month

Why not Kubernetes? Because for a single VPS it's overkill. Docker Compose with restart policies and health checks covers 95% of what you need for a web service with a few thousand users.


Docker Compose: Dev vs Production

Development

# docker-compose.yml
services:
  db:
    image: pgvector/pgvector:pg16
    ports:
      - "5433:5432"
    environment:
      POSTGRES_DB: ragdb
      POSTGRES_USER: raguser
      POSTGRES_PASSWORD: localpass123
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U raguser -d ragdb"]
      interval: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  backend:
    build: ./backend
    ports:
      - "8000:8000"
    volumes:
      - ./backend/app:/app/app  # Hot reload
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started

  frontend:
    build: ./frontend
    ports:
      - "5173:5173"
    volumes:
      - ./frontend/src:/app/src  # Hot reload
Enter fullscreen mode Exit fullscreen mode

Nothing surprising: exposed ports, volumes for hot reload, basic health checks.

Production: The Differences That Matter

# docker-compose.prod.yml
services:
  db:
    image: pgvector/pgvector:pg16
    container_name: app-db
    restart: always
    env_file: .env.production
    ports:
      - "127.0.0.1:5432:5432"  # Localhost only
    volumes:
      - pgdata:/var/lib/postgresql/data
    command: >
      postgres
        -c shared_buffers=128MB
        -c effective_cache_size=256MB
        -c max_connections=50
        -c work_mem=4MB
        -c maintenance_work_mem=64MB
        -c random_page_cost=1.1
        -c effective_io_concurrency=200
        -c wal_buffers=4MB
        -c checkpoint_completion_target=0.9
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U $$POSTGRES_USER -d $$POSTGRES_DB"]
      interval: 10s
      retries: 5

  redis:
    image: redis:7-alpine
    container_name: app-redis
    restart: always
    command: redis-server --maxmemory 64mb --maxmemory-policy allkeys-lru
    ports:
      - "127.0.0.1:6379:6379"  # Localhost only
    volumes:
      - redisdata:/data

  backend:
    build:
      context: ./backend
      dockerfile: Dockerfile
    container_name: app-backend
    restart: always
    env_file: .env.production
    ports:
      - "127.0.0.1:8000:8000"  # Localhost only, Nginx in front
    deploy:
      resources:
        limits:
          memory: 1536M
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started

  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile.prod  # Multi-stage with Nginx
    container_name: app-frontend
    restart: always
    ports:
      - "127.0.0.1:5173:5173"  # Localhost only
    deploy:
      resources:
        limits:
          memory: 64M
Enter fullscreen mode Exit fullscreen mode

The Key Differences

1. Ports bound to localhost only

ports:
  - "127.0.0.1:8000:8000"  # ✅ Only Nginx can access
  # vs
  - "8000:8000"  # ❌ Open to the world
Enter fullscreen mode Exit fullscreen mode

If you publish a port without 127.0.0.1, Docker modifies iptables and bypasses the system firewall. This is a classic mistake.

2. PostgreSQL tuned for 4GB

shared_buffers=128MB        # ~25% of RAM available for PG (~512MB)
effective_cache_size=256MB  # What the OS can cache
max_connections=50          # You don't need 100 with an async backend
work_mem=4MB               # Careful: multiplied by connection × sort ops
random_page_cost=1.1       # SSD, not spinning disk
Enter fullscreen mode Exit fullscreen mode

PostgreSQL defaults assume a dedicated server with 1GB+ of RAM just for PG. On a shared VPS with 4 other services, you need to be conservative.

3. Redis with strict limits

maxmemory 64mb
maxmemory-policy allkeys-lru
Enter fullscreen mode Exit fullscreen mode

Redis without maxmemory can grow indefinitely and trigger the OOM killer. With allkeys-lru, when it hits the limit, it evicts the least recently used keys instead of returning errors.

4. Memory limits on the backend

deploy:
  resources:
    limits:
      memory: 1536M
Enter fullscreen mode Exit fullscreen mode

The backend with loaded embedding models uses ~800MB-1.2GB. The 1536MB limit gives it headroom without allowing a memory leak to consume the entire VPS.


Dockerfiles: Multi-Stage Builds

Backend: Pre-Download Models

# === Stage 1: Builder ===
FROM python:3.12-slim AS builder

WORKDIR /build
COPY requirements.txt .

# PyTorch CPU-only (saves ~1.5GB vs CUDA version)
RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu \
    && pip install --no-cache-dir -r requirements.txt

# Pre-download embedding and cross-encoder models
RUN python -c "
from sentence_transformers import SentenceTransformer, CrossEncoder
SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
"

# === Stage 2: Runtime ===
FROM python:3.12-slim AS runtime

WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends \
    libfontconfig1 curl ca-certificates && rm -rf /var/lib/apt/lists/*

# Copy dependencies and pre-downloaded models
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY --from=builder /root/.cache /root/.cache

COPY . .
RUN chmod +x start.sh

EXPOSE 8000
CMD ["./start.sh"]
Enter fullscreen mode Exit fullscreen mode

Why pre-download models during build? Without it, the first request after each deploy takes 30-60 seconds while models download. With pre-download, the container starts ready to serve.

Why PyTorch CPU-only? The CUDA version weighs ~2GB extra. On a VPS without GPU it's dead weight.

Frontend: Build + Static Nginx

# === Stage 1: Build ===
FROM node:20-alpine AS build

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# === Stage 2: Serve ===
FROM nginx:1.27-alpine

# Copy static build
COPY --from=build /app/dist /usr/share/nginx/html

# SPA config
COPY nginx.conf /etc/nginx/conf.d/default.conf

EXPOSE 5173
CMD ["nginx", "-g", "daemon off;"]
Enter fullscreen mode Exit fullscreen mode

The production frontend does NOT run Vite. It's Nginx serving static files. The image goes from ~400MB (node + deps) to ~25MB (nginx alpine + dist).

Internal Frontend Nginx (SPA)

server {
    listen 5173;
    root /usr/share/nginx/html;
    index index.html;

    # Vite-hashed assets → aggressive cache
    location /assets/ {
        expires 1y;
        add_header Cache-Control "public, immutable";
    }

    # index.html → never cache (so new deploys are reflected)
    location = /index.html {
        add_header Cache-Control "no-cache";
    }

    # SPA fallback: all routes → index.html
    location / {
        try_files $uri $uri/ /index.html;
    }

    gzip on;
    gzip_types text/plain text/css application/json application/javascript;
}
Enter fullscreen mode Exit fullscreen mode

Nginx: The Reverse Proxy That Connects Everything

# === Main HTTPS ===
server {
    listen 443 ssl http2;
    server_name your-domain.com;

    # Origin certificates (Cloudflare → VPS)
    ssl_certificate     /etc/ssl/certs/origin.pem;
    ssl_certificate_key /etc/ssl/private/origin.key;
    ssl_protocols TLSv1.2 TLSv1.3;

    client_max_body_size 50M;  # For document uploads

    # === Maintenance mode ===
    set $maintenance 0;
    if (-f /etc/nginx/maintenance.on) {
        set $maintenance 1;
    }

    # Health check always available (for monitoring)
    location = /api/v1/health {
        proxy_pass http://127.0.0.1:8000;
    }

    # If maintenance mode → 503
    if ($maintenance) {
        return 503;
    }

    # === API Backend ===
    location /api/ {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # SSE streaming: CRITICAL
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
        proxy_set_header Connection '';
        proxy_http_version 1.1;
        chunked_transfer_encoding off;
    }

    # === Widget JS (cacheable) ===
    location = /widget.js {
        proxy_pass http://127.0.0.1:8000;
        proxy_cache_valid 200 1h;
    }

    # === Frontend SPA ===
    location / {
        proxy_pass http://127.0.0.1:5173;
    }

    # === Gzip ===
    gzip on;
    gzip_comp_level 4;
    gzip_min_length 256;
    gzip_types text/plain text/css application/json application/javascript
               text/xml application/xml text/javascript image/svg+xml;

    # === Security headers ===
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header Referrer-Policy "strict-origin-when-cross-origin" always;
}

# === HTTP → HTTPS redirect ===
server {
    listen 80;
    server_name your-domain.com;
    return 301 https://$server_name$request_uri;
}

# === www → non-www ===
server {
    listen 443 ssl http2;
    server_name www.your-domain.com;
    ssl_certificate     /etc/ssl/certs/origin.pem;
    ssl_certificate_key /etc/ssl/private/origin.key;
    return 301 https://your-domain.com$request_uri;
}

# === 503 Maintenance page ===
error_page 503 @maintenance;
location @maintenance {
    default_type text/html;
    return 503 '<!DOCTYPE html>
    <html><head><meta charset="UTF-8"><title>Maintenance</title>
    <style>body{font-family:system-ui;display:flex;justify-content:center;
    align-items:center;min-height:100vh;background:#0f172a;color:#e2e8f0;
    text-align:center}h1{font-size:2rem}p{color:#94a3b8}</style></head>
    <body><div><h1>Under Maintenance</h1>
    <p>We will be back in a few minutes.</p></div></body></html>';
}
Enter fullscreen mode Exit fullscreen mode

The SSE Block That Almost Broke Everything

location /api/ {
    proxy_buffering off;      # Nginx must NOT buffer the response
    proxy_cache off;          # Nor cache it
    proxy_read_timeout 300s;  # SSE can last minutes
    proxy_set_header Connection '';  # Disable upstream keep-alive
    proxy_http_version 1.1;   # HTTP/1.1 required for chunked
    chunked_transfer_encoding off;
}
Enter fullscreen mode Exit fullscreen mode

Without these headers, Nginx buffers SSE tokens and sends them all at once at the end. The user sees "nothing nothing nothing... entire text at once". These 6 parameters are mandatory for real streaming through a reverse proxy.


The Deploy Script

#!/bin/bash
set -e

SERVER="user@server-ip"
SSH_KEY="~/.ssh/deploy_key"
PROJECT_DIR="/opt/my-app"
BACKUP_DIR="/opt/my-app/backups/db"
DEPLOY_MODE="${1:-full}"  # frontend | backend | full

ssh_cmd() {
    ssh -i "$SSH_KEY" "$SERVER" "$1"
}

echo "=== Deploy: $DEPLOY_MODE ==="

# 1. Push code
git push origin main

# 2. Pull on server
ssh_cmd "cd $PROJECT_DIR && git fetch origin && git reset --hard origin/main"

# 3. Database backup (backend/full only)
if [[ "$DEPLOY_MODE" != "frontend" ]]; then
    echo "Creating DB backup..."
    TIMESTAMP=$(date +%Y%m%d_%H%M%S)
    ssh_cmd "docker exec app-db pg_dump -U \$POSTGRES_USER \$POSTGRES_DB \
        | gzip > $BACKUP_DIR/backup_${TIMESTAMP}.gz"

    # Enable maintenance mode
    ssh_cmd "touch /etc/nginx/maintenance.on && nginx -s reload"
    echo "Maintenance mode: ON"
fi

# 4. Rebuild and restart
case $DEPLOY_MODE in
    frontend)
        ssh_cmd "cd $PROJECT_DIR && docker compose -f docker-compose.prod.yml \
            build frontend && docker compose -f docker-compose.prod.yml \
            up -d frontend"
        ;;
    backend)
        ssh_cmd "cd $PROJECT_DIR && docker compose -f docker-compose.prod.yml \
            build backend && docker compose -f docker-compose.prod.yml \
            up -d backend"
        ;;
    full)
        ssh_cmd "cd $PROJECT_DIR && docker compose -f docker-compose.prod.yml \
            up -d --build"
        ;;
esac

# 5. Disable maintenance mode
if [[ "$DEPLOY_MODE" != "frontend" ]]; then
    sleep 15  # Wait for backend to load models
    ssh_cmd "rm -f /etc/nginx/maintenance.on && nginx -s reload"
    echo "Maintenance mode: OFF"
fi

# 6. Health check
echo "Checking health..."
MAX_RETRIES=30
for i in $(seq 1 $MAX_RETRIES); do
    STATUS=$(ssh_cmd "curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:8000/api/v1/health")
    if [[ "$STATUS" == "200" ]]; then
        echo "Deploy successful. Health: OK"
        exit 0
    fi
    echo "Attempt $i/$MAX_RETRIES... (status: $STATUS)"
    sleep 5
done

echo "ERROR: Health check failed after $MAX_RETRIES attempts"
exit 1
Enter fullscreen mode Exit fullscreen mode

Why Maintenance Mode?

The backend takes ~15 seconds to start: it runs Alembic migrations and then Uvicorn loads the embedding models. Without maintenance mode, during those 15 seconds Nginx returns 502 Bad Gateway.

With the /etc/nginx/maintenance.on file, Nginx returns a styled 503 page. The health check (/api/v1/health) is exempted so the script can verify when the backend is ready.


Backend Start Script

#!/bin/bash
set -e

echo "Running database migrations..."
alembic upgrade head

echo "Starting FastAPI server..."

# 1 worker = ~800MB with loaded models
# 2 workers = ~1.3GB (only if you have RAM to spare)
WORKERS="${UVICORN_WORKERS:-1}"

exec uvicorn app.main:app \
    --host 0.0.0.0 \
    --port 8000 \
    --workers "$WORKERS" \
    --log-level "${UVICORN_LOG_LEVEL:-info}" \
    "$@"
Enter fullscreen mode Exit fullscreen mode

Why 1 worker? Each Uvicorn worker loads its own copy of the embedding models (~500MB). With 2 workers you're already at 1.3GB for backend alone. On a 4GB VPS with PostgreSQL, Redis and Nginx, 1 worker is the safe choice.

FastAPI is async, so 1 worker handles concurrency well — I/O operations (DB, LLM API, Redis) don't block the event loop.


Automated Maintenance with Cron

#!/bin/bash
# Run: crontab -e → 0 4 * * 0 /opt/my-app/scripts/maintenance.sh

LOG="/var/log/app-maintenance.log"
echo "=== Maintenance $(date) ===" >> "$LOG"

# 1. Clean orphaned Docker images (>7 days)
docker image prune -f --filter "until=168h" >> "$LOG" 2>&1

# 2. Clean stopped containers
docker container prune -f --filter "until=168h" >> "$LOG" 2>&1

# 3. Check disk usage
DISK_USAGE=$(df / --output=pcent | tail -1 | tr -dc '0-9')
if [ "$DISK_USAGE" -gt 80 ]; then
    echo "ALERT: Disk at ${DISK_USAGE}%" >> "$LOG"
fi

# 4. Check memory
MEM_USAGE=$(free | awk '/Mem:/ {printf("%.0f"), $3/$2 * 100}')
if [ "$MEM_USAGE" -gt 90 ]; then
    echo "ALERT: RAM at ${MEM_USAGE}%" >> "$LOG"
fi

# 5. Container status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Size}}" >> "$LOG"

# 6. Backup inventory
BACKUP_COUNT=$(ls /opt/my-app/backups/db/*.gz 2>/dev/null | wc -l)
LATEST=$(ls -t /opt/my-app/backups/db/*.gz 2>/dev/null | head -1)
echo "Backups: $BACKUP_COUNT files. Latest: $LATEST" >> "$LOG"

echo "=== Done ===" >> "$LOG"
Enter fullscreen mode Exit fullscreen mode

This runs every Sunday at 4am. It cleans Docker garbage, checks disk and RAM, and logs everything. Simple but effective — it saved me twice from running out of disk due to accumulated Docker images.


Cloudflare + SSL: The Configuration

Internet → Cloudflare (proxy) → VPS (Nginx 443) → Docker containers
Enter fullscreen mode Exit fullscreen mode

Setup

  1. DNS on Cloudflare: A record pointing to VPS, proxy enabled (orange cloud)
  2. SSL on Cloudflare: "Full (strict)" — encrypts both browser→Cloudflare and Cloudflare→VPS
  3. Origin certificate: Generated in Cloudflare Dashboard → SSL/TLS → Origin Server → Create Certificate
  4. On the VPS: Copy the certificate and key to /etc/ssl/certs/origin.pem and /etc/ssl/private/origin.key

Why not Let's Encrypt? With Cloudflare proxy enabled, Let's Encrypt can't verify the domain via HTTP challenge (Cloudflare intercepts). Cloudflare origin certificates last 15 years and are configured in 2 minutes.

Bonus: Cloudflare as Free CDN

With the proxy enabled, Cloudflare automatically caches static assets (JS, CSS, images). The VPS only receives API requests and HTML. This significantly reduces server traffic.


RAM Distribution

Here's how it looks on a 4GB VPS:

┌──────────────────────────────────────────┐
│              4GB RAM Total               │
├──────────────────────────────────────────┤
│  OS + Docker Engine      ~400MB          │
│  PostgreSQL              ~200-400MB      │
│  Backend (1 worker)      ~800MB-1.2GB    │
│  Redis                   ≤64MB           │
│  Nginx + Frontend        ~30MB           │
│  Free / Buffer           ~1.5-2GB        │
├──────────────────────────────────────────┤
│  Swap (2GB)              emergency       │
└──────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The ~1.5-2GB free are for:

  • OS filesystem cache (helps PostgreSQL)
  • Traffic spikes
  • Maintenance operations (backups, builds)

Tip: Always configure swap (2GB) as a safety net. Without swap, the OOM killer terminates processes without warning when RAM fills up.

fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
Enter fullscreen mode Exit fullscreen mode

Security Checklist

Before considering the deploy "ready":

  • [x] Docker ports on localhost only (127.0.0.1:port:port)
  • [x] Firewall active (ufw allow 22,80,443/tcp && ufw enable)
  • [x] SSH key-only (disable password auth in /etc/ssh/sshd_config)
  • [x] .env.production outside the repo (.gitignore)
  • [x] Secrets never in compose files (use env_file reference)
  • [x] DB without external port (only accessible via Docker network)
  • [x] Automated backups with periodic verification
  • [x] Health check in the deploy script
  • [x] Swap configured to prevent OOM kills

Real Numbers

Metric Value
Build time (backend) ~3-4 min (first time), ~1 min (cached)
Build time (frontend) ~30s
Full deploy time ~2-3 min
Visible downtime 0s (maintenance page during rebuild)
Backend image size ~2.1GB (includes PyTorch CPU + models)
Frontend image size ~25MB (nginx + dist)
Stable RAM usage ~2-2.5GB of 4GB
Monthly cost ~$24 (VPS) + $0 (Cloudflare free)
Uptime last month 99.8% (one outage for OS update)

Lessons Learned

1. Docker + Firewall = Be Careful

Docker modifies iptables directly. If you publish a port without 127.0.0.1, ufw deny won't block it. Always use 127.0.0.1: in production and let Nginx be the only entry point.

2. Pre-Download ML Models in the Build

If models download at runtime, the first cold start takes a minute. Worse: if HuggingFace has downtime, your deploy fails. Pre-downloading in the Dockerfile eliminates both problems.

3. 1 Uvicorn Worker Is Enough (If It's Async)

The temptation is to set 4 workers "just in case". But each one loads ~500MB of models. With async FastAPI, a single worker handles hundreds of concurrent requests. Scale workers only when you have evidence that CPU is the bottleneck.

4. Maintenance Mode > Zero-Downtime Rolling Deploys

For a single-node VPS, a rolling deploy requires complex orchestration. A 503 page for 15 seconds is infinitely simpler and nobody complains — especially if the health check guarantees automatic deactivation.

5. PostgreSQL Defaults Are for Dedicated Servers

PostgreSQL defaults assume it has all the RAM to itself. On a shared VPS with 4 other services, not tuning PostgreSQL is a guaranteed OOM kill. shared_buffers, effective_cache_size and max_connections are the first parameters to adjust.

6. Cloudflare Origin Certs > Let's Encrypt

With proxy enabled, Let's Encrypt is more complicated to configure and renew. Cloudflare origin certs last 15 years, are configured once and forgotten.


What's Next

  • Monitoring with Prometheus + Grafana: Latency, errors, and resource usage metrics (currently logs + cron only)
  • Offsite backup: Copy backups to an S3/R2 bucket instead of storing them only on the same VPS
  • Blue-green deploys: When traffic justifies a second VPS
  • CI/CD with GitHub Actions: Automate the deploy script (currently a manual ./deploy.sh backend)

Conclusion

Deploying a RAG system isn't like deploying a CRUD app. Embedding models consume real RAM, SSE streaming needs specific Nginx configuration, and PostgreSQL with pgvector needs careful tuning on limited servers.

But you also don't need Kubernetes or a $200/month cluster. A $24 VPS with well-configured Docker Compose, Nginx as reverse proxy, and deploy scripts with maintenance mode + health checks is enough to serve thousands of queries per day with consistent latencies.

Most importantly: measure first, scale later. Starting with 1 worker, 4GB of RAM and basic monitoring gives you all the information you need to make infrastructure decisions based on real data, not estimates.


This is the second article in the series. The first covers the RAG pipeline architecture. If it was useful, a like helps it reach more people. Questions about any part of the deploy? Drop a comment.

Top comments (0)