I deployed a full RAG engine (FastAPI + PostgreSQL + pgvector + Redis) on a 4GB RAM VPS for $24/month. This article covers the real deployment architecture: Docker multi-stage builds, PostgreSQL tuned for limited resources, Nginx as reverse proxy with SSE support, zero-downtime deploys with maintenance mode, automated backups and cron monitoring.
The Context
In the previous article I built a production RAG pipeline with hybrid search, cross-encoder reranking and semantic cache. Everything worked perfectly in local Docker.
The problem: getting it to production on a budget VPS without it exploding.
A RAG system isn't a typical CRUD app. It has:
- Embedding models that consume ~500MB of RAM per worker
- PostgreSQL with heavy extensions (pgvector + HNSW indexes)
- SSE streaming that needs long-lived connections
- Redis for rate limiting and cache
- All of that competing for 4GB of RAM
Chosen Infrastructure
| Component | Specification |
|---|---|
| VPS | DigitalOcean 4GB RAM / 2 vCPU / 80GB SSD |
| OS | Ubuntu 24.04 LTS |
| Containers | Docker Compose (5 services) |
| SSL | Cloudflare (proxy + origin certificates) |
| DNS | Cloudflare |
| Total cost | ~$24/month |
Why not Kubernetes? Because for a single VPS it's overkill. Docker Compose with restart policies and health checks covers 95% of what you need for a web service with a few thousand users.
Docker Compose: Dev vs Production
Development
# docker-compose.yml
services:
db:
image: pgvector/pgvector:pg16
ports:
- "5433:5432"
environment:
POSTGRES_DB: ragdb
POSTGRES_USER: raguser
POSTGRES_PASSWORD: localpass123
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U raguser -d ragdb"]
interval: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
backend:
build: ./backend
ports:
- "8000:8000"
volumes:
- ./backend/app:/app/app # Hot reload
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
frontend:
build: ./frontend
ports:
- "5173:5173"
volumes:
- ./frontend/src:/app/src # Hot reload
Nothing surprising: exposed ports, volumes for hot reload, basic health checks.
Production: The Differences That Matter
# docker-compose.prod.yml
services:
db:
image: pgvector/pgvector:pg16
container_name: app-db
restart: always
env_file: .env.production
ports:
- "127.0.0.1:5432:5432" # Localhost only
volumes:
- pgdata:/var/lib/postgresql/data
command: >
postgres
-c shared_buffers=128MB
-c effective_cache_size=256MB
-c max_connections=50
-c work_mem=4MB
-c maintenance_work_mem=64MB
-c random_page_cost=1.1
-c effective_io_concurrency=200
-c wal_buffers=4MB
-c checkpoint_completion_target=0.9
healthcheck:
test: ["CMD-SHELL", "pg_isready -U $$POSTGRES_USER -d $$POSTGRES_DB"]
interval: 10s
retries: 5
redis:
image: redis:7-alpine
container_name: app-redis
restart: always
command: redis-server --maxmemory 64mb --maxmemory-policy allkeys-lru
ports:
- "127.0.0.1:6379:6379" # Localhost only
volumes:
- redisdata:/data
backend:
build:
context: ./backend
dockerfile: Dockerfile
container_name: app-backend
restart: always
env_file: .env.production
ports:
- "127.0.0.1:8000:8000" # Localhost only, Nginx in front
deploy:
resources:
limits:
memory: 1536M
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
frontend:
build:
context: ./frontend
dockerfile: Dockerfile.prod # Multi-stage with Nginx
container_name: app-frontend
restart: always
ports:
- "127.0.0.1:5173:5173" # Localhost only
deploy:
resources:
limits:
memory: 64M
The Key Differences
1. Ports bound to localhost only
ports:
- "127.0.0.1:8000:8000" # ✅ Only Nginx can access
# vs
- "8000:8000" # ❌ Open to the world
If you publish a port without 127.0.0.1, Docker modifies iptables and bypasses the system firewall. This is a classic mistake.
2. PostgreSQL tuned for 4GB
shared_buffers=128MB # ~25% of RAM available for PG (~512MB)
effective_cache_size=256MB # What the OS can cache
max_connections=50 # You don't need 100 with an async backend
work_mem=4MB # Careful: multiplied by connection × sort ops
random_page_cost=1.1 # SSD, not spinning disk
PostgreSQL defaults assume a dedicated server with 1GB+ of RAM just for PG. On a shared VPS with 4 other services, you need to be conservative.
3. Redis with strict limits
maxmemory 64mb
maxmemory-policy allkeys-lru
Redis without maxmemory can grow indefinitely and trigger the OOM killer. With allkeys-lru, when it hits the limit, it evicts the least recently used keys instead of returning errors.
4. Memory limits on the backend
deploy:
resources:
limits:
memory: 1536M
The backend with loaded embedding models uses ~800MB-1.2GB. The 1536MB limit gives it headroom without allowing a memory leak to consume the entire VPS.
Dockerfiles: Multi-Stage Builds
Backend: Pre-Download Models
# === Stage 1: Builder ===
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
# PyTorch CPU-only (saves ~1.5GB vs CUDA version)
RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu \
&& pip install --no-cache-dir -r requirements.txt
# Pre-download embedding and cross-encoder models
RUN python -c "
from sentence_transformers import SentenceTransformer, CrossEncoder
SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
"
# === Stage 2: Runtime ===
FROM python:3.12-slim AS runtime
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends \
libfontconfig1 curl ca-certificates && rm -rf /var/lib/apt/lists/*
# Copy dependencies and pre-downloaded models
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY --from=builder /root/.cache /root/.cache
COPY . .
RUN chmod +x start.sh
EXPOSE 8000
CMD ["./start.sh"]
Why pre-download models during build? Without it, the first request after each deploy takes 30-60 seconds while models download. With pre-download, the container starts ready to serve.
Why PyTorch CPU-only? The CUDA version weighs ~2GB extra. On a VPS without GPU it's dead weight.
Frontend: Build + Static Nginx
# === Stage 1: Build ===
FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# === Stage 2: Serve ===
FROM nginx:1.27-alpine
# Copy static build
COPY --from=build /app/dist /usr/share/nginx/html
# SPA config
COPY nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 5173
CMD ["nginx", "-g", "daemon off;"]
The production frontend does NOT run Vite. It's Nginx serving static files. The image goes from ~400MB (node + deps) to ~25MB (nginx alpine + dist).
Internal Frontend Nginx (SPA)
server {
listen 5173;
root /usr/share/nginx/html;
index index.html;
# Vite-hashed assets → aggressive cache
location /assets/ {
expires 1y;
add_header Cache-Control "public, immutable";
}
# index.html → never cache (so new deploys are reflected)
location = /index.html {
add_header Cache-Control "no-cache";
}
# SPA fallback: all routes → index.html
location / {
try_files $uri $uri/ /index.html;
}
gzip on;
gzip_types text/plain text/css application/json application/javascript;
}
Nginx: The Reverse Proxy That Connects Everything
# === Main HTTPS ===
server {
listen 443 ssl http2;
server_name your-domain.com;
# Origin certificates (Cloudflare → VPS)
ssl_certificate /etc/ssl/certs/origin.pem;
ssl_certificate_key /etc/ssl/private/origin.key;
ssl_protocols TLSv1.2 TLSv1.3;
client_max_body_size 50M; # For document uploads
# === Maintenance mode ===
set $maintenance 0;
if (-f /etc/nginx/maintenance.on) {
set $maintenance 1;
}
# Health check always available (for monitoring)
location = /api/v1/health {
proxy_pass http://127.0.0.1:8000;
}
# If maintenance mode → 503
if ($maintenance) {
return 503;
}
# === API Backend ===
location /api/ {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# SSE streaming: CRITICAL
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
proxy_set_header Connection '';
proxy_http_version 1.1;
chunked_transfer_encoding off;
}
# === Widget JS (cacheable) ===
location = /widget.js {
proxy_pass http://127.0.0.1:8000;
proxy_cache_valid 200 1h;
}
# === Frontend SPA ===
location / {
proxy_pass http://127.0.0.1:5173;
}
# === Gzip ===
gzip on;
gzip_comp_level 4;
gzip_min_length 256;
gzip_types text/plain text/css application/json application/javascript
text/xml application/xml text/javascript image/svg+xml;
# === Security headers ===
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
}
# === HTTP → HTTPS redirect ===
server {
listen 80;
server_name your-domain.com;
return 301 https://$server_name$request_uri;
}
# === www → non-www ===
server {
listen 443 ssl http2;
server_name www.your-domain.com;
ssl_certificate /etc/ssl/certs/origin.pem;
ssl_certificate_key /etc/ssl/private/origin.key;
return 301 https://your-domain.com$request_uri;
}
# === 503 Maintenance page ===
error_page 503 @maintenance;
location @maintenance {
default_type text/html;
return 503 '<!DOCTYPE html>
<html><head><meta charset="UTF-8"><title>Maintenance</title>
<style>body{font-family:system-ui;display:flex;justify-content:center;
align-items:center;min-height:100vh;background:#0f172a;color:#e2e8f0;
text-align:center}h1{font-size:2rem}p{color:#94a3b8}</style></head>
<body><div><h1>Under Maintenance</h1>
<p>We will be back in a few minutes.</p></div></body></html>';
}
The SSE Block That Almost Broke Everything
location /api/ {
proxy_buffering off; # Nginx must NOT buffer the response
proxy_cache off; # Nor cache it
proxy_read_timeout 300s; # SSE can last minutes
proxy_set_header Connection ''; # Disable upstream keep-alive
proxy_http_version 1.1; # HTTP/1.1 required for chunked
chunked_transfer_encoding off;
}
Without these headers, Nginx buffers SSE tokens and sends them all at once at the end. The user sees "nothing nothing nothing... entire text at once". These 6 parameters are mandatory for real streaming through a reverse proxy.
The Deploy Script
#!/bin/bash
set -e
SERVER="user@server-ip"
SSH_KEY="~/.ssh/deploy_key"
PROJECT_DIR="/opt/my-app"
BACKUP_DIR="/opt/my-app/backups/db"
DEPLOY_MODE="${1:-full}" # frontend | backend | full
ssh_cmd() {
ssh -i "$SSH_KEY" "$SERVER" "$1"
}
echo "=== Deploy: $DEPLOY_MODE ==="
# 1. Push code
git push origin main
# 2. Pull on server
ssh_cmd "cd $PROJECT_DIR && git fetch origin && git reset --hard origin/main"
# 3. Database backup (backend/full only)
if [[ "$DEPLOY_MODE" != "frontend" ]]; then
echo "Creating DB backup..."
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
ssh_cmd "docker exec app-db pg_dump -U \$POSTGRES_USER \$POSTGRES_DB \
| gzip > $BACKUP_DIR/backup_${TIMESTAMP}.gz"
# Enable maintenance mode
ssh_cmd "touch /etc/nginx/maintenance.on && nginx -s reload"
echo "Maintenance mode: ON"
fi
# 4. Rebuild and restart
case $DEPLOY_MODE in
frontend)
ssh_cmd "cd $PROJECT_DIR && docker compose -f docker-compose.prod.yml \
build frontend && docker compose -f docker-compose.prod.yml \
up -d frontend"
;;
backend)
ssh_cmd "cd $PROJECT_DIR && docker compose -f docker-compose.prod.yml \
build backend && docker compose -f docker-compose.prod.yml \
up -d backend"
;;
full)
ssh_cmd "cd $PROJECT_DIR && docker compose -f docker-compose.prod.yml \
up -d --build"
;;
esac
# 5. Disable maintenance mode
if [[ "$DEPLOY_MODE" != "frontend" ]]; then
sleep 15 # Wait for backend to load models
ssh_cmd "rm -f /etc/nginx/maintenance.on && nginx -s reload"
echo "Maintenance mode: OFF"
fi
# 6. Health check
echo "Checking health..."
MAX_RETRIES=30
for i in $(seq 1 $MAX_RETRIES); do
STATUS=$(ssh_cmd "curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:8000/api/v1/health")
if [[ "$STATUS" == "200" ]]; then
echo "Deploy successful. Health: OK"
exit 0
fi
echo "Attempt $i/$MAX_RETRIES... (status: $STATUS)"
sleep 5
done
echo "ERROR: Health check failed after $MAX_RETRIES attempts"
exit 1
Why Maintenance Mode?
The backend takes ~15 seconds to start: it runs Alembic migrations and then Uvicorn loads the embedding models. Without maintenance mode, during those 15 seconds Nginx returns 502 Bad Gateway.
With the /etc/nginx/maintenance.on file, Nginx returns a styled 503 page. The health check (/api/v1/health) is exempted so the script can verify when the backend is ready.
Backend Start Script
#!/bin/bash
set -e
echo "Running database migrations..."
alembic upgrade head
echo "Starting FastAPI server..."
# 1 worker = ~800MB with loaded models
# 2 workers = ~1.3GB (only if you have RAM to spare)
WORKERS="${UVICORN_WORKERS:-1}"
exec uvicorn app.main:app \
--host 0.0.0.0 \
--port 8000 \
--workers "$WORKERS" \
--log-level "${UVICORN_LOG_LEVEL:-info}" \
"$@"
Why 1 worker? Each Uvicorn worker loads its own copy of the embedding models (~500MB). With 2 workers you're already at 1.3GB for backend alone. On a 4GB VPS with PostgreSQL, Redis and Nginx, 1 worker is the safe choice.
FastAPI is async, so 1 worker handles concurrency well — I/O operations (DB, LLM API, Redis) don't block the event loop.
Automated Maintenance with Cron
#!/bin/bash
# Run: crontab -e → 0 4 * * 0 /opt/my-app/scripts/maintenance.sh
LOG="/var/log/app-maintenance.log"
echo "=== Maintenance $(date) ===" >> "$LOG"
# 1. Clean orphaned Docker images (>7 days)
docker image prune -f --filter "until=168h" >> "$LOG" 2>&1
# 2. Clean stopped containers
docker container prune -f --filter "until=168h" >> "$LOG" 2>&1
# 3. Check disk usage
DISK_USAGE=$(df / --output=pcent | tail -1 | tr -dc '0-9')
if [ "$DISK_USAGE" -gt 80 ]; then
echo "ALERT: Disk at ${DISK_USAGE}%" >> "$LOG"
fi
# 4. Check memory
MEM_USAGE=$(free | awk '/Mem:/ {printf("%.0f"), $3/$2 * 100}')
if [ "$MEM_USAGE" -gt 90 ]; then
echo "ALERT: RAM at ${MEM_USAGE}%" >> "$LOG"
fi
# 5. Container status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Size}}" >> "$LOG"
# 6. Backup inventory
BACKUP_COUNT=$(ls /opt/my-app/backups/db/*.gz 2>/dev/null | wc -l)
LATEST=$(ls -t /opt/my-app/backups/db/*.gz 2>/dev/null | head -1)
echo "Backups: $BACKUP_COUNT files. Latest: $LATEST" >> "$LOG"
echo "=== Done ===" >> "$LOG"
This runs every Sunday at 4am. It cleans Docker garbage, checks disk and RAM, and logs everything. Simple but effective — it saved me twice from running out of disk due to accumulated Docker images.
Cloudflare + SSL: The Configuration
Internet → Cloudflare (proxy) → VPS (Nginx 443) → Docker containers
Setup
- DNS on Cloudflare: A record pointing to VPS, proxy enabled (orange cloud)
- SSL on Cloudflare: "Full (strict)" — encrypts both browser→Cloudflare and Cloudflare→VPS
- Origin certificate: Generated in Cloudflare Dashboard → SSL/TLS → Origin Server → Create Certificate
-
On the VPS: Copy the certificate and key to
/etc/ssl/certs/origin.pemand/etc/ssl/private/origin.key
Why not Let's Encrypt? With Cloudflare proxy enabled, Let's Encrypt can't verify the domain via HTTP challenge (Cloudflare intercepts). Cloudflare origin certificates last 15 years and are configured in 2 minutes.
Bonus: Cloudflare as Free CDN
With the proxy enabled, Cloudflare automatically caches static assets (JS, CSS, images). The VPS only receives API requests and HTML. This significantly reduces server traffic.
RAM Distribution
Here's how it looks on a 4GB VPS:
┌──────────────────────────────────────────┐
│ 4GB RAM Total │
├──────────────────────────────────────────┤
│ OS + Docker Engine ~400MB │
│ PostgreSQL ~200-400MB │
│ Backend (1 worker) ~800MB-1.2GB │
│ Redis ≤64MB │
│ Nginx + Frontend ~30MB │
│ Free / Buffer ~1.5-2GB │
├──────────────────────────────────────────┤
│ Swap (2GB) emergency │
└──────────────────────────────────────────┘
The ~1.5-2GB free are for:
- OS filesystem cache (helps PostgreSQL)
- Traffic spikes
- Maintenance operations (backups, builds)
Tip: Always configure swap (2GB) as a safety net. Without swap, the OOM killer terminates processes without warning when RAM fills up.
fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
Security Checklist
Before considering the deploy "ready":
- [x] Docker ports on localhost only (
127.0.0.1:port:port) - [x] Firewall active (
ufw allow 22,80,443/tcp && ufw enable) - [x] SSH key-only (disable password auth in
/etc/ssh/sshd_config) - [x] .env.production outside the repo (
.gitignore) - [x] Secrets never in compose files (use
env_filereference) - [x] DB without external port (only accessible via Docker network)
- [x] Automated backups with periodic verification
- [x] Health check in the deploy script
- [x] Swap configured to prevent OOM kills
Real Numbers
| Metric | Value |
|---|---|
| Build time (backend) | ~3-4 min (first time), ~1 min (cached) |
| Build time (frontend) | ~30s |
| Full deploy time | ~2-3 min |
| Visible downtime | 0s (maintenance page during rebuild) |
| Backend image size | ~2.1GB (includes PyTorch CPU + models) |
| Frontend image size | ~25MB (nginx + dist) |
| Stable RAM usage | ~2-2.5GB of 4GB |
| Monthly cost | ~$24 (VPS) + $0 (Cloudflare free) |
| Uptime last month | 99.8% (one outage for OS update) |
Lessons Learned
1. Docker + Firewall = Be Careful
Docker modifies iptables directly. If you publish a port without 127.0.0.1, ufw deny won't block it. Always use 127.0.0.1: in production and let Nginx be the only entry point.
2. Pre-Download ML Models in the Build
If models download at runtime, the first cold start takes a minute. Worse: if HuggingFace has downtime, your deploy fails. Pre-downloading in the Dockerfile eliminates both problems.
3. 1 Uvicorn Worker Is Enough (If It's Async)
The temptation is to set 4 workers "just in case". But each one loads ~500MB of models. With async FastAPI, a single worker handles hundreds of concurrent requests. Scale workers only when you have evidence that CPU is the bottleneck.
4. Maintenance Mode > Zero-Downtime Rolling Deploys
For a single-node VPS, a rolling deploy requires complex orchestration. A 503 page for 15 seconds is infinitely simpler and nobody complains — especially if the health check guarantees automatic deactivation.
5. PostgreSQL Defaults Are for Dedicated Servers
PostgreSQL defaults assume it has all the RAM to itself. On a shared VPS with 4 other services, not tuning PostgreSQL is a guaranteed OOM kill. shared_buffers, effective_cache_size and max_connections are the first parameters to adjust.
6. Cloudflare Origin Certs > Let's Encrypt
With proxy enabled, Let's Encrypt is more complicated to configure and renew. Cloudflare origin certs last 15 years, are configured once and forgotten.
What's Next
- Monitoring with Prometheus + Grafana: Latency, errors, and resource usage metrics (currently logs + cron only)
- Offsite backup: Copy backups to an S3/R2 bucket instead of storing them only on the same VPS
- Blue-green deploys: When traffic justifies a second VPS
-
CI/CD with GitHub Actions: Automate the deploy script (currently a manual
./deploy.sh backend)
Conclusion
Deploying a RAG system isn't like deploying a CRUD app. Embedding models consume real RAM, SSE streaming needs specific Nginx configuration, and PostgreSQL with pgvector needs careful tuning on limited servers.
But you also don't need Kubernetes or a $200/month cluster. A $24 VPS with well-configured Docker Compose, Nginx as reverse proxy, and deploy scripts with maintenance mode + health checks is enough to serve thousands of queries per day with consistent latencies.
Most importantly: measure first, scale later. Starting with 1 worker, 4GB of RAM and basic monitoring gives you all the information you need to make infrastructure decisions based on real data, not estimates.
This is the second article in the series. The first covers the RAG pipeline architecture. If it was useful, a like helps it reach more people. Questions about any part of the deploy? Drop a comment.
Top comments (0)