Written from 15+ years of experience deploying containerized systems at scale across fullstack, AI/ML, IoT, and robotics domains
After architecting containerized deployments for everything from high-frequency trading platforms to autonomous robot fleets, I've learned that production Docker deployments require far more than just writing a Dockerfile. This comprehensive guide distills hard-won lessons from real-world deployments into actionable strategies for 2025 and beyond.
Table of Contents
- Modern Multi-Stage Build Patterns
- Security-First Container Design
- Health Checks and Self-Healing
- Environment Configuration & Secrets
- Production Logging & Observability
- Orchestration: Kubernetes vs Docker Swarm
- Domain-Specific Deployments
- Scaling Architecture Patterns
- CI/CD Integration & GitOps
- Monitoring & Troubleshooting
- Future-Proofing Your Deployments
Modern Multi-Stage Build Patterns {#modern-multi-stage-builds}
Multi-stage builds are no longer optional—they're fundamental to production deployments. Here's why and how to use them effectively:
The Problems Multi-Stage Builds Solve
- Image Bloat: Development dependencies shouldn't ship to production
- Attack Surface: Build tools are unnecessary security risks in runtime
- Reproducibility: Separate build from runtime for consistent deploys
Production-Ready Multi-Stage Pattern
# ========================================
# Stage 1: Build Environment
# ========================================
FROM node:20-alpine AS builder
# Install build dependencies only
RUN apk add --no-cache python3 make g++
WORKDIR /build
# Layer caching optimization: Copy dependency files first
COPY package*.json ./
COPY yarn.lock* ./
# Install ALL dependencies (including devDependencies)
RUN npm ci
# Copy source code
COPY . .
# Build application
RUN npm run build && \
npm prune --production
# ========================================
# Stage 2: Production Runtime
# ========================================
FROM node:20-alpine
# Security: Create non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
# Install only runtime dependencies
RUN apk add --no-cache dumb-init
WORKDIR /app
# Copy only production artifacts
COPY --from=builder --chown=nodejs:nodejs /build/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /build/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /build/package.json ./
# Switch to non-root user
USER nodejs
# Use dumb-init for proper signal handling
ENTRYPOINT ["dumb-init", "--"]
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
CMD node -e "require('http').get('http://localhost:3000/health', (r) => { process.exit(r.statusCode === 200 ? 0 : 1) })"
EXPOSE 3000
CMD ["node", "dist/index.js"]
Advanced Multi-Stage Techniques
For Python/ML Applications:
# Build stage with full conda environment
FROM continuumio/miniconda3:latest AS builder
WORKDIR /build
COPY environment.yml .
RUN conda env create -f environment.yml && \
conda clean -afy
# Production stage with minimal runtime
FROM python:3.11-slim
COPY --from=builder /opt/conda/envs/myenv /opt/conda/envs/myenv
ENV PATH="/opt/conda/envs/myenv/bin:$PATH"
WORKDIR /app
COPY . .
CMD ["python", "app.py"]
Key Lessons:
-
Always use specific version tags, never
latest - Order layers by change frequency (dependencies before code)
- Use
.dockerignoreaggressively (node_modules, .git, tests, etc.) - Consider distroless or scratch images for maximum security
Security-First Container Design {#security-first-design}
Security must be baked in from the start. Here's my battle-tested security stack:
1. Base Image Selection & Scanning
# Use Trivy for vulnerability scanning
trivy image --severity HIGH,CRITICAL myapp:latest
# Use Grype for additional coverage
grype myapp:latest
# Integrate into CI/CD
docker build -t myapp:${CI_COMMIT_SHA} .
trivy image --exit-code 1 --severity CRITICAL myapp:${CI_COMMIT_SHA}
Tool Selection (2025):
- Trivy: Best open-source scanner, fast, comprehensive (OS packages + app dependencies)
- Grype: Excellent SBOM-driven scanning
- Snyk: Enterprise choice with fix suggestions and CI/CD integrations
- Docker Scout: Native Docker integration, real-time insights
2. Non-Root User Pattern
# WRONG - Running as root
FROM ubuntu:22.04
COPY app /app
CMD ["/app/server"]
# CORRECT - Non-root with proper permissions
FROM ubuntu:22.04
RUN groupadd -r appuser && \
useradd -r -g appuser -u 1001 appuser && \
mkdir /app && \
chown -R appuser:appuser /app
COPY --chown=appuser:appuser app /app
USER appuser
WORKDIR /app
CMD ["./server"]
3. Read-Only Root Filesystem
# docker-compose.yml
services:
api:
image: myapp:latest
read_only: true
tmpfs:
- /tmp:noexec,nosuid,size=100m
volumes:
- ./data:/app/data
security_opt:
- no-new-privileges:true
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
4. Secrets Management
NEVER do this:
# WRONG!
ENV DB_PASSWORD=mysecretpassword
ENV API_KEY=abc123
Production Pattern:
# Using Docker Swarm secrets
version: '3.8'
services:
app:
image: myapp:latest
environment:
- NODE_ENV=production
- DATABASE_URL_FILE=/run/secrets/db_url
secrets:
- db_url
- api_key
deploy:
replicas: 3
secrets:
db_url:
external: true
api_key:
external: true
For Kubernetes:
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
type: Opaque
stringData:
database-url: "postgresql://..."
api-key: "..."
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: app
envFrom:
- secretRef:
name: app-secrets
Enterprise Pattern: Use External Secret Managers
# Using External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: vault-backend
spec:
provider:
vault:
server: "https://vault.company.com"
auth:
kubernetes:
mountPath: "kubernetes"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
spec:
secretStoreRef:
name: vault-backend
target:
name: app-secrets
data:
- secretKey: database-url
remoteRef:
key: secret/data/app/database
property: url
5. Image Signing & Verification
# Sign images with Cosign (2025 standard)
cosign sign --key cosign.key myregistry/myapp:v1.0
# Verify before deployment
cosign verify --key cosign.pub myregistry/myapp:v1.0
Health Checks and Self-Healing {#health-checks}
Proper health checks are the difference between 99.9% and 99.99% uptime.
Dockerfile Health Checks
# Basic HTTP health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
# Advanced health check with dependencies
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8080/health/ready || exit 1
Application-Level Health Endpoints
// Express.js health check pattern
const express = require('express');
const app = express();
let isReady = false;
// Liveness: Is the application running?
app.get('/health/live', (req, res) => {
res.status(200).json({ status: 'alive', timestamp: Date.now() });
});
// Readiness: Is the application ready to serve traffic?
app.get('/health/ready', async (req, res) => {
try {
// Check database connection
await db.ping();
// Check Redis connection
await redis.ping();
// Check external API dependencies
await checkExternalServices();
res.status(200).json({
status: 'ready',
timestamp: Date.now(),
dependencies: { db: 'ok', cache: 'ok', apis: 'ok' }
});
} catch (error) {
res.status(503).json({
status: 'not ready',
error: error.message,
timestamp: Date.now()
});
}
});
// Startup: Has initialization completed?
app.get('/health/startup', (req, res) => {
if (isReady) {
res.status(200).json({ status: 'started' });
} else {
res.status(503).json({ status: 'starting' });
}
});
Kubernetes Probes (Production Pattern)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: myapp:v1.0
ports:
- containerPort: 8080
# Startup probe: Gives app time to initialize
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30
periodSeconds: 10
# Liveness probe: Restart if unhealthy
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe: Remove from service if not ready
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
Critical Insight: Separate liveness from readiness. Liveness failures restart pods; readiness failures just remove them from load balancers. A dependency failure should affect readiness, not liveness.
Environment Configuration & Secrets {#configuration-management}
Configuration management makes or breaks production deployments. Here's the hierarchy I use:
Configuration Hierarchy
1. Secrets (never in code or config files)
2. Environment variables (deployment-specific)
3. Config files (mounted as volumes)
4. Application defaults (in code)
Docker Compose Production Pattern
version: '3.8'
services:
api:
image: ${REGISTRY}/myapp:${VERSION}
environment:
- NODE_ENV=production
- LOG_LEVEL=${LOG_LEVEL:-info}
- DATABASE_URL=${DATABASE_URL}
env_file:
- .env.production
secrets:
- db_password
- jwt_secret
configs:
- source: app_config
target: /app/config/production.yml
deploy:
replicas: 3
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
monitor: 30s
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '1'
memory: 1G
secrets:
db_password:
external: true
jwt_secret:
external: true
configs:
app_config:
file: ./config/production.yml
Kubernetes ConfigMap + Secret Pattern
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
app.yml: |
server:
port: 8080
timeout: 30s
features:
newFeature: true
logging:
level: info
---
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
- name: secrets
mountPath: /app/secrets
readOnly: true
volumes:
- name: config
configMap:
name: app-config
- name: secrets
secret:
secretName: app-secrets
Production Logging & Observability {#logging-observability}
Logging is not optional. Here's my production stack:
Structured Logging Pattern
// Winston configuration for production
const winston = require('winston');
const { ElasticsearchTransport } = require('winston-elasticsearch');
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'myapp',
version: process.env.VERSION,
environment: process.env.NODE_ENV
},
transports: [
// Console for Docker logs
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.simple()
)
}),
// Elasticsearch for centralized logging
new ElasticsearchTransport({
level: 'info',
clientOpts: {
node: process.env.ELASTICSEARCH_URL,
auth: {
username: process.env.ES_USER,
password: process.env.ES_PASSWORD
}
}
})
],
exceptionHandlers: [
new winston.transports.File({ filename: 'exceptions.log' })
],
rejectionHandlers: [
new winston.transports.File({ filename: 'rejections.log' })
]
});
// Request correlation middleware
app.use((req, res, next) => {
req.id = req.headers['x-request-id'] || uuid.v4();
req.logger = logger.child({ requestId: req.id });
next();
});
Docker Logging Configuration
# docker-compose.yml
services:
api:
image: myapp:latest
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
labels: "service,environment"
labels:
service: "api"
environment: "production"
Production Observability Stack (2025)
version: '3.8'
services:
# Application
myapp:
image: myapp:latest
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
- OTEL_SERVICE_NAME=myapp
- OTEL_RESOURCE_ATTRIBUTES=environment=production,version=${VERSION}
depends_on:
- otel-collector
# OpenTelemetry Collector
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yml"]
volumes:
- ./otel-collector-config.yml:/etc/otel-collector-config.yml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
# Prometheus (Metrics)
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
# Grafana (Visualization)
grafana:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_PASSWORD=secret
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
ports:
- "3000:3000"
# Loki (Logs)
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/local-config.yml
- loki-data:/loki
# Tempo (Traces)
tempo:
image: grafana/tempo:latest
command: [ "-config.file=/etc/tempo.yml" ]
volumes:
- ./tempo.yml:/etc/tempo.yml
- tempo-data:/tmp/tempo
# Jaeger (Alternative distributed tracing)
jaeger:
image: jaegertracing/all-in-one:latest
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686" # Jaeger UI
- "14268:14268" # Collector HTTP
- "4317:4317" # OTLP gRPC
volumes:
prometheus-data:
grafana-data:
loki-data:
tempo-data:
Application Instrumentation
// OpenTelemetry instrumentation
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
}),
metricReader: new PrometheusExporter({
port: 9464,
}),
serviceName: 'myapp',
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
Orchestration: Kubernetes vs Docker Swarm {#orchestration-choice}
The eternal question. Here's my decision framework after deploying both in production:
Decision Matrix
| Factor | Kubernetes | Docker Swarm |
|---|---|---|
| Team Size | 5+ engineers | 2-4 engineers |
| Complexity | High (steep learning curve) | Low (Docker-native) |
| Ecosystem | Massive (70%+ market share) | Limited but stable |
| Multi-cloud | Excellent | Limited |
| Resource Overhead | Higher | Lower |
| Advanced Features | StatefulSets, Jobs, CronJobs, Custom Resources | Basic orchestration |
| Community Support | Extensive | Limited |
| Best For | Large-scale, complex deployments | Small-medium deployments |
When to Choose Kubernetes
- Scale: Running 50+ services or 100+ containers
- Multi-cloud: Deploying across AWS, GCP, Azure
- Advanced patterns: Need service mesh, GitOps, custom operators
- Team expertise: Engineers familiar with K8s
- Ecosystem: Need Helm charts, operators, CNCF tools
When to Choose Docker Swarm
- Simplicity: Small team, straightforward deployment
- Docker-native: Already using Docker Compose
- Resource-constrained: Edge deployments, small clusters
- Quick deployment: Need to ship fast without K8s complexity
- Learning curve: Team new to orchestration
Docker Swarm Production Setup
# Initialize swarm
docker swarm init --advertise-addr <MANAGER-IP>
# Add workers
docker swarm join --token <WORKER-TOKEN> <MANAGER-IP>:2377
# Deploy stack
docker stack deploy -c docker-compose.yml myapp
# Scale service
docker service scale myapp_api=5
# Rolling update
docker service update --image myapp:v2 myapp_api
# Monitor
docker service ls
docker service ps myapp_api
Kubernetes Production Setup (K3s for Edge/IoT)
# Install K3s (lightweight K8s)
curl -sfL https://get.k3s.io | sh -
# Deploy application
kubectl apply -f deployment.yml
# Scale
kubectl scale deployment myapp --replicas=5
# Rolling update
kubectl set image deployment/myapp app=myapp:v2
# Monitor
kubectl get pods
kubectl top pods
kubectl logs -f deployment/myapp
Hybrid Approach: K3s/K8s at Edge, K8s in Cloud
# Edge K3s cluster (resource-constrained)
apiVersion: v1
kind: Namespace
metadata:
name: edge-production
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: edge-processor
namespace: edge-production
spec:
replicas: 2
template:
spec:
containers:
- name: processor
image: myapp:edge
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
nodeSelector:
node-role.kubernetes.io/edge: "true"
tolerations:
- key: "node-role.kubernetes.io/edge"
operator: "Exists"
effect: "NoSchedule"
Domain-Specific Deployments {#domain-specific}
Fullstack Applications {#fullstack}
Frontend + Backend + Database Pattern
version: '3.8'
services:
# Frontend (React/Next.js)
frontend:
build:
context: ./frontend
dockerfile: Dockerfile.prod
ports:
- "80:80"
- "443:443"
depends_on:
- backend
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- certbot-certs:/etc/letsencrypt
- certbot-webroot:/var/www/certbot
deploy:
replicas: 2
resources:
limits:
cpus: '0.5'
memory: 256M
# Backend (Node.js/Python/Go)
backend:
image: ${REGISTRY}/backend:${VERSION}
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://postgres:5432/mydb
- REDIS_URL=redis://redis:6379
depends_on:
- db
- redis
deploy:
replicas: 3
resources:
limits:
cpus: '1'
memory: 1G
restart_policy:
condition: on-failure
# Database (PostgreSQL)
db:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
volumes:
- postgres-data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
secrets:
- db_password
deploy:
placement:
constraints:
- node.labels.db == true
# Cache (Redis)
redis:
image: redis:7-alpine
command: redis-server --appendonly yes
volumes:
- redis-data:/data
# Background Jobs (Celery/Bull)
worker:
image: ${REGISTRY}/backend:${VERSION}
command: celery -A app.celery worker --loglevel=info
depends_on:
- redis
- db
deploy:
replicas: 2
volumes:
postgres-data:
redis-data:
certbot-certs:
certbot-webroot:
secrets:
db_password:
external: true
Nginx Configuration for Production
# nginx.conf
upstream backend {
least_conn;
server backend:8080 max_fails=3 fail_timeout=30s;
server backend:8080 max_fails=3 fail_timeout=30s;
server backend:8080 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name example.com www.example.com;
# Redirect HTTP to HTTPS
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
server_name example.com www.example.com;
ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;
# Modern SSL configuration
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
# Security headers
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
# Static files
location /static {
alias /usr/share/nginx/html/static;
expires 1y;
add_header Cache-Control "public, immutable";
}
# API proxy
location /api {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 30s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
}
# SPA fallback
location / {
root /usr/share/nginx/html;
try_files $uri $uri/ /index.html;
}
}
AI/ML Model Serving {#ai-ml}
GPU-Accelerated ML Deployment
# Dockerfile for PyTorch/TensorFlow with GPU
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
# Install Python and dependencies
RUN apt-get update && apt-get install -y \
python3.11 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Install ML frameworks
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
# Copy model and application
COPY models/ ./models/
COPY app.py .
# Non-root user
RUN useradd -m -u 1001 mluser && \
chown -R mluser:mluser /app
USER mluser
# Expose API
EXPOSE 8000
# Run with Gunicorn + Uvicorn workers
CMD ["gunicorn", "app:app", \
"--workers", "4", \
"--worker-class", "uvicorn.workers.UvicornWorker", \
"--bind", "0.0.0.0:8000", \
"--timeout", "120"]
Kubernetes ML Deployment with GPU
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference
spec:
replicas: 2
template:
spec:
containers:
- name: model-server
image: myregistry/ml-model:v1.0
ports:
- containerPort: 8000
resources:
requests:
memory: "4Gi"
cpu: "2"
nvidia.com/gpu: 1
limits:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: 1
env:
- name: MODEL_PATH
value: "/models/my-model"
- name: BATCH_SIZE
value: "32"
volumeMounts:
- name: models
mountPath: /models
readOnly: true
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: models
persistentVolumeClaim:
claimName: model-storage
nodeSelector:
accelerator: nvidia-tesla-t4
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
FastAPI ML Serving Pattern
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
import numpy as np
from typing import List
import logging
app = FastAPI()
# Load model at startup
model = None
@app.on_event("startup")
async def load_model():
global model
model = torch.load('/models/my-model.pth')
model.eval()
logging.info("Model loaded successfully")
class PredictionRequest(BaseModel):
data: List[List[float]]
class PredictionResponse(BaseModel):
predictions: List[float]
confidence: List[float]
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
try:
input_tensor = torch.tensor(request.data, dtype=torch.float32)
with torch.no_grad():
output = model(input_tensor)
predictions = output.argmax(dim=1).tolist()
confidence = torch.softmax(output, dim=1).max(dim=1).values.tolist()
return PredictionResponse(
predictions=predictions,
confidence=confidence
)
except Exception as e:
logging.error(f"Prediction error: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health():
return {"status": "healthy", "model_loaded": model is not None}
@app.get("/metrics")
async def metrics():
# Prometheus metrics endpoint
return {"requests_total": 1000, "avg_latency_ms": 45}
MLOps Pipeline with Model Registry
version: '3.8'
services:
# MLflow for experiment tracking
mlflow:
image: ghcr.io/mlflow/mlflow:latest
command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://mlflow:password@db:5432/mlflow --default-artifact-root s3://mlflow-artifacts
ports:
- "5000:5000"
environment:
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
depends_on:
- db
# Model serving
model-server:
image: myregistry/ml-model:${MODEL_VERSION}
environment:
- MLFLOW_TRACKING_URI=http://mlflow:5000
- MODEL_NAME=my-production-model
- MODEL_STAGE=Production
depends_on:
- mlflow
deploy:
replicas: 3
resources:
limits:
nvidia.com/gpu: 1
IoT & Edge Computing {#iot-edge}
Edge Deployment with K3s
# Dockerfile for ARM64 edge devices
FROM arm64v8/python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
libgpiod2 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Run with resource constraints
CMD ["python3", "edge_processor.py"]
IoT Stack with MQTT
version: '3.8'
services:
# MQTT Broker (Eclipse Mosquitto)
mqtt:
image: eclipse-mosquitto:2
ports:
- "1883:1883"
- "9001:9001"
volumes:
- ./mosquitto.conf:/mosquitto/config/mosquitto.conf
- mosquitto-data:/mosquitto/data
- mosquitto-logs:/mosquitto/log
# IoT Gateway
gateway:
image: myregistry/iot-gateway:latest
environment:
- MQTT_BROKER=mqtt://mqtt:1883
- DEVICE_ID=${DEVICE_ID}
- CLOUD_ENDPOINT=${CLOUD_ENDPOINT}
depends_on:
- mqtt
devices:
- "/dev/ttyUSB0:/dev/ttyUSB0"
privileged: true
deploy:
resources:
limits:
cpus: '0.5'
memory: 256M
# Edge Analytics
analytics:
image: myregistry/edge-analytics:latest
environment:
- MQTT_BROKER=mqtt://mqtt:1883
- INFLUXDB_URL=http://influxdb:8086
depends_on:
- mqtt
- influxdb
# Time-series Database
influxdb:
image: influxdb:2.7-alpine
ports:
- "8086:8086"
volumes:
- influxdb-data:/var/lib/influxdb2
environment:
- INFLUXDB_DB=iot_data
- INFLUXDB_HTTP_AUTH_ENABLED=true
# Grafana for visualization
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
depends_on:
- influxdb
volumes:
mosquitto-data:
mosquitto-logs:
influxdb-data:
grafana-data:
Edge Computing Best Practices
# edge_processor.py - Optimized for resource-constrained devices
import paho.mqtt.client as mqtt
import json
import logging
from collections import deque
import time
class EdgeProcessor:
def __init__(self):
self.mqtt_client = mqtt.Client()
self.buffer = deque(maxlen=1000) # Circular buffer
self.batch_size = 100
self.last_upload = time.time()
def process_sensor_data(self, data):
# Edge processing: Filter noise, aggregate, compress
if self.is_valid(data):
processed = self.preprocess(data)
self.buffer.append(processed)
# Batch upload to cloud
if len(self.buffer) >= self.batch_size or \
time.time() - self.last_upload > 300: # 5 min
self.upload_batch()
def preprocess(self, data):
# Run lightweight inference on edge
return {
'timestamp': data['timestamp'],
'value': data['value'],
'anomaly': self.detect_anomaly(data['value'])
}
def upload_batch(self):
if self.buffer:
batch = list(self.buffer)
self.mqtt_client.publish('cloud/data', json.dumps(batch))
self.buffer.clear()
self.last_upload = time.time()
Robotics Systems (ROS/ROS2) {#robotics}
ROS2 Docker Deployment
# Dockerfile for ROS2 Humble
FROM ros:humble-ros-base-jammy
# Install dependencies
RUN apt-get update && apt-get install -y \
ros-humble-navigation2 \
ros-humble-slam-toolbox \
ros-humble-robot-localization \
python3-colcon-common-extensions \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /ros2_ws
# Copy workspace
COPY src/ src/
# Build ROS2 workspace
RUN . /opt/ros/humble/setup.sh && \
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release
# Setup entrypoint
COPY ./ros_entrypoint.sh /
RUN chmod +x /ros_entrypoint.sh
ENTRYPOINT ["/ros_entrypoint.sh"]
CMD ["ros2", "launch", "my_robot", "robot.launch.py"]
Multi-Robot Fleet Management
version: '3.8'
services:
# ROS Master / Discovery Server
ros2-discovery:
image: ros:humble
command: ros2 daemon start
network_mode: host
environment:
- ROS_DOMAIN_ID=0
# Robot 1
robot1:
image: myregistry/robot:v1.0
environment:
- ROBOT_ID=robot1
- ROS_DOMAIN_ID=0
- ROBOT_NAMESPACE=/robot1
devices:
- /dev/video0:/dev/video0
- /dev/ttyACM0:/dev/ttyACM0
privileged: true
network_mode: host
# Robot 2
robot2:
image: myregistry/robot:v1.0
environment:
- ROBOT_ID=robot2
- ROS_DOMAIN_ID=0
- ROBOT_NAMESPACE=/robot2
devices:
- /dev/video1:/dev/video1
- /dev/ttyACM1:/dev/ttyACM1
privileged: true
network_mode: host
# Fleet Manager
fleet-manager:
image: myregistry/fleet-manager:latest
ports:
- "8080:8080"
environment:
- ROS_DOMAIN_ID=0
network_mode: host
depends_on:
- ros2-discovery
# Visualization (RViz)
rviz:
image: myregistry/robot:v1.0
command: ros2 run rviz2 rviz2
environment:
- DISPLAY=$DISPLAY
- ROS_DOMAIN_ID=0
volumes:
- /tmp/.X11-unix:/tmp/.X11-unix:rw
network_mode: host
Scaling Architecture Patterns {#scaling-patterns}
Horizontal Pod Autoscaling (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
selectPolicy: Max
Vertical Pod Autoscaling (VPA)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: myapp-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: myapp
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
Cluster Autoscaling (Cloud Providers)
# AWS EKS Node Group with autoscaling
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production-cluster
region: us-west-2
managedNodeGroups:
- name: general-purpose
instanceType: t3.xlarge
minSize: 3
maxSize: 10
desiredCapacity: 5
volumeSize: 100
ssh:
allow: false
labels:
role: general
tags:
nodegroup-role: general
iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
ebs: true
- name: gpu-nodes
instanceType: g4dn.xlarge
minSize: 0
maxSize: 5
desiredCapacity: 0
volumeSize: 200
labels:
accelerator: nvidia-tesla-t4
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
Service Mesh for Advanced Traffic Management
# Istio VirtualService for canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp.example.com
http:
- match:
- headers:
user-agent:
regex: ".*Mobile.*"
route:
- destination:
host: myapp
subset: v2
weight: 100
- route:
- destination:
host: myapp
subset: v1
weight: 90
- destination:
host: myapp
subset: v2
weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp
spec:
host: myapp
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
Database Scaling Patterns
# PostgreSQL with replication
version: '3.8'
services:
postgres-primary:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
POSTGRES_REPLICATION_MODE: master
POSTGRES_REPLICATION_USER: replicator
POSTGRES_REPLICATION_PASSWORD_FILE: /run/secrets/repl_password
volumes:
- postgres-primary-data:/var/lib/postgresql/data
deploy:
placement:
constraints:
- node.labels.db.primary == true
postgres-replica1:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
POSTGRES_REPLICATION_MODE: slave
POSTGRES_MASTER_SERVICE: postgres-primary
POSTGRES_REPLICATION_USER: replicator
POSTGRES_REPLICATION_PASSWORD_FILE: /run/secrets/repl_password
volumes:
- postgres-replica1-data:/var/lib/postgresql/data
depends_on:
- postgres-primary
# Read-only connection pooler
pgbouncer:
image: pgbouncer/pgbouncer:latest
environment:
- DATABASES_HOST=postgres-primary
- DATABASES_PORT=5432
- DATABASES_DBNAME=mydb
- PGBOUNCER_POOL_MODE=transaction
- PGBOUNCER_MAX_CLIENT_CONN=1000
- PGBOUNCER_DEFAULT_POOL_SIZE=25
ports:
- "6432:6432"
CI/CD Integration & GitOps {#cicd-gitops}
GitHub Actions CI/CD Pipeline
# .github/workflows/deploy.yml
name: Build and Deploy
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run tests
run: |
docker compose -f docker-compose.test.yml up --abort-on-container-exit
docker compose -f docker-compose.test.yml down
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
run: docker build -t ${{ env.IMAGE_NAME}}:${{ github.sha }} .
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.IMAGE_NAME }}:${{ github.sha }}
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
exit-code: '1'
- name: Upload Trivy results to GitHub Security
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: 'trivy-results.sarif'
build-and-push:
needs: [test, security-scan]
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=ref,event=pr
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=sha,prefix={{branch}}-
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Sign image with Cosign
run: |
cosign sign --yes ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}
env:
COSIGN_EXPERIMENTAL: "true"
deploy-staging:
needs: build-and-push
if: github.ref == 'refs/heads/develop'
runs-on: ubuntu-latest
environment: staging
steps:
- name: Deploy to staging
run: |
kubectl set image deployment/myapp \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:develop-${{ github.sha }} \
--namespace=staging
deploy-production:
needs: build-and-push
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: |
kubectl set image deployment/myapp \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
--namespace=production
- name: Wait for rollout
run: |
kubectl rollout status deployment/myapp --namespace=production --timeout=5m
- name: Run smoke tests
run: |
curl -f https://api.example.com/health || (kubectl rollout undo deployment/myapp --namespace=production && exit 1)
GitOps with ArgoCD
# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/myapp-k8s-manifests
targetRevision: HEAD
path: overlays/production
kustomize:
images:
- myregistry/myapp:v1.2.3
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
revisionHistoryLimit: 10
Blue-Green Deployment Strategy
# Blue-Green with Kubernetes
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # Switch to 'green' for deployment
ports:
- port: 80
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: app
image: myapp:v1.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: app
image: myapp:v2.0
Monitoring & Troubleshooting {#monitoring}
Prometheus Monitoring Setup
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/alerts/*.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'docker'
docker_sd_configs:
- host: unix:///var/run/docker.sock
relabel_configs:
- source_labels: [__meta_docker_container_name]
target_label: container
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Alert Rules
# alerts.yml
groups:
- name: application_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
- alert: HighMemoryUsage
expr: |
(
container_memory_usage_bytes{name!=""}
/
container_spec_memory_limit_bytes{name!=""}
) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high memory usage"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
- alert: DeploymentReplicasMismatch
expr: |
kube_deployment_spec_replicas
!=
kube_deployment_status_replicas_available
for: 10m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch"
Grafana Dashboards (Provisioned)
# grafana/dashboards/app-dashboard.json (simplified)
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)"
}
]
},
{
"title": "Response Time (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))"
}
]
}
]
}
}
Distributed Tracing
// OpenTelemetry tracing setup
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'myapp',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION,
environment: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': {
enabled: false,
},
}),
],
});
sdk.start();
Debugging Containers
# Essential debugging commands
# View logs
docker logs -f <container-id>
kubectl logs -f deployment/myapp
kubectl logs -f deployment/myapp --previous # Previous container
# Execute commands in container
docker exec -it <container-id> /bin/sh
kubectl exec -it deployment/myapp -- /bin/sh
# Check resource usage
docker stats
kubectl top pods
kubectl top nodes
# Describe resources
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'
# Debug networking
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- /bin/bash
# Inside debug pod:
# nslookup myservice
# curl myservice:8080/health
# tcpdump -i any port 8080
# Port forwarding
kubectl port-forward deployment/myapp 8080:8080
# Copy files from container
kubectl cp <pod-name>:/app/logs ./local-logs
# View cluster info
kubectl cluster-info dump
Future-Proofing Your Deployments {#future-trends}
Emerging Trends for 2025-2027
1. WebAssembly (Wasm) Containers
# Future: Wasm-based microVMs
FROM scratch
COPY --from=build /app/main.wasm /
CMD ["/main.wasm"]
2. eBPF for Observability
- Deep kernel-level insights without code changes
- Better security and network monitoring
- Tools: Cilium, Falco, Pixie
3. Platform Engineering & Internal Developer Platforms
# Backstage + Kubernetes for self-service
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: myapp
spec:
type: service
lifecycle: production
owner: team-backend
system: core-platform
providesApis:
- myapi-v1
consumesApis:
- auth-api
- payment-api
4. Green Computing & Carbon-Aware Scheduling
# Schedule workloads based on carbon intensity
apiVersion: v1
kind: Pod
spec:
schedulerName: carbon-aware-scheduler
nodeSelector:
carbon-intensity: low
5. AI-Driven Operations (AIOps)
- Predictive scaling based on ML models
- Anomaly detection in metrics/logs
- Automated incident response
Checklist for Production Readiness
- [ ] Multi-stage builds with minimal base images
- [ ] Non-root users in all containers
- [ ] Vulnerability scanning in CI/CD
- [ ] Secrets managed externally (Vault, AWS Secrets Manager)
- [ ] Health checks (liveness, readiness, startup)
- [ ] Resource requests and limits defined
- [ ] Horizontal and vertical autoscaling configured
- [ ] Monitoring and alerting set up
- [ ] Distributed tracing implemented
- [ ] Structured logging with correlation IDs
- [ ] Backup and disaster recovery plan
- [ ] Blue-green or canary deployment strategy
- [ ] Network policies defined
- [ ] Pod Security Standards enforced
- [ ] GitOps workflow established
- [ ] Documentation for runbooks and incident response
Conclusion
Production Docker deployments in 2025 require a holistic approach that goes far beyond writing Dockerfiles. Success comes from:
- Security by default: Non-root users, minimal images, vulnerability scanning
- Observability first: Metrics, logs, traces from day one
- Scale-ready architecture: Design for horizontal scaling with stateless services
- Automation everywhere: CI/CD, GitOps, auto-scaling, self-healing
- Domain-specific optimizations: Tailor your approach for fullstack, AI/ML, IoT, or robotics
The container orchestration landscape continues to evolve, but these fundamental principles remain constant. Whether you choose Kubernetes for its ecosystem or Docker Swarm for simplicity, focus on building resilient, observable, and secure systems.
Remember: The best deployment strategy is one that your team can actually maintain. Start simple, measure everything, and iterate based on real production data.
Stay curious, keep learning, and happy deploying!
Originally published at padawanabhi.de
Top comments (0)