Abhishek Nair

Posted on Mar 24 • Originally published at padawanabhi.de

Modern Docker Deployment Strategies for Production

#devops #docker #cicd

Written from 15+ years of experience deploying containerized systems at scale across fullstack, AI/ML, IoT, and robotics domains

After architecting containerized deployments for everything from high-frequency trading platforms to autonomous robot fleets, I've learned that production Docker deployments require far more than just writing a Dockerfile. This comprehensive guide distills hard-won lessons from real-world deployments into actionable strategies for 2025 and beyond.

Modern Multi-Stage Build Patterns
Security-First Container Design
Health Checks and Self-Healing
Environment Configuration & Secrets
Production Logging & Observability
Orchestration: Kubernetes vs Docker Swarm
Domain-Specific Deployments
Scaling Architecture Patterns
CI/CD Integration & GitOps
Monitoring & Troubleshooting
Future-Proofing Your Deployments

Modern Multi-Stage Build Patterns {#modern-multi-stage-builds}

Multi-stage builds are no longer optional—they're fundamental to production deployments. Here's why and how to use them effectively:

The Problems Multi-Stage Builds Solve

Image Bloat: Development dependencies shouldn't ship to production
Attack Surface: Build tools are unnecessary security risks in runtime
Reproducibility: Separate build from runtime for consistent deploys

Production-Ready Multi-Stage Pattern

# ========================================
# Stage 1: Build Environment
# ========================================
FROM node:20-alpine AS builder

# Install build dependencies only
RUN apk add --no-cache python3 make g++

WORKDIR /build

# Layer caching optimization: Copy dependency files first
COPY package*.json ./
COPY yarn.lock* ./

# Install ALL dependencies (including devDependencies)
RUN npm ci

# Copy source code
COPY . .

# Build application
RUN npm run build && \
    npm prune --production

# ========================================
# Stage 2: Production Runtime
# ========================================
FROM node:20-alpine

# Security: Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

# Install only runtime dependencies
RUN apk add --no-cache dumb-init

WORKDIR /app

# Copy only production artifacts
COPY --from=builder --chown=nodejs:nodejs /build/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /build/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /build/package.json ./

# Switch to non-root user
USER nodejs

# Use dumb-init for proper signal handling
ENTRYPOINT ["dumb-init", "--"]

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
    CMD node -e "require('http').get('http://localhost:3000/health', (r) => { process.exit(r.statusCode === 200 ? 0 : 1) })"

EXPOSE 3000

CMD ["node", "dist/index.js"]

Advanced Multi-Stage Techniques

For Python/ML Applications:

# Build stage with full conda environment
FROM continuumio/miniconda3:latest AS builder
WORKDIR /build
COPY environment.yml .
RUN conda env create -f environment.yml && \
    conda clean -afy

# Production stage with minimal runtime
FROM python:3.11-slim
COPY --from=builder /opt/conda/envs/myenv /opt/conda/envs/myenv
ENV PATH="/opt/conda/envs/myenv/bin:$PATH"
WORKDIR /app
COPY . .
CMD ["python", "app.py"]

Key Lessons:

Always use specific version tags, never latest
Order layers by change frequency (dependencies before code)
Use .dockerignore aggressively (node_modules, .git, tests, etc.)
Consider distroless or scratch images for maximum security

Security-First Container Design {#security-first-design}

Security must be baked in from the start. Here's my battle-tested security stack:

1. Base Image Selection & Scanning

# Use Trivy for vulnerability scanning
trivy image --severity HIGH,CRITICAL myapp:latest

# Use Grype for additional coverage
grype myapp:latest

# Integrate into CI/CD
docker build -t myapp:${CI_COMMIT_SHA} .
trivy image --exit-code 1 --severity CRITICAL myapp:${CI_COMMIT_SHA}

Tool Selection (2025):

Trivy: Best open-source scanner, fast, comprehensive (OS packages + app dependencies)
Grype: Excellent SBOM-driven scanning
Snyk: Enterprise choice with fix suggestions and CI/CD integrations
Docker Scout: Native Docker integration, real-time insights

2. Non-Root User Pattern

# WRONG - Running as root
FROM ubuntu:22.04
COPY app /app
CMD ["/app/server"]

# CORRECT - Non-root with proper permissions
FROM ubuntu:22.04

RUN groupadd -r appuser && \
    useradd -r -g appuser -u 1001 appuser && \
    mkdir /app && \
    chown -R appuser:appuser /app

COPY --chown=appuser:appuser app /app

USER appuser
WORKDIR /app

CMD ["./server"]

3. Read-Only Root Filesystem

# docker-compose.yml
services:
  api:
    image: myapp:latest
    read_only: true
    tmpfs:
      - /tmp:noexec,nosuid,size=100m
    volumes:
      - ./data:/app/data
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE

4. Secrets Management

NEVER do this:

# WRONG!
ENV DB_PASSWORD=mysecretpassword
ENV API_KEY=abc123

Production Pattern:

# Using Docker Swarm secrets
version: '3.8'
services:
  app:
    image: myapp:latest
    environment:
      - NODE_ENV=production
      - DATABASE_URL_FILE=/run/secrets/db_url
    secrets:
      - db_url
      - api_key
    deploy:
      replicas: 3

secrets:
  db_url:
    external: true
  api_key:
    external: true

For Kubernetes:

apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
stringData:
  database-url: "postgresql://..."
  api-key: "..."
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
      - name: app
        envFrom:
        - secretRef:
            name: app-secrets

Enterprise Pattern: Use External Secret Managers

# Using External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: vault-backend
spec:
  provider:
    vault:
      server: "https://vault.company.com"
      auth:
        kubernetes:
          mountPath: "kubernetes"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  secretStoreRef:
    name: vault-backend
  target:
    name: app-secrets
  data:
  - secretKey: database-url
    remoteRef:
      key: secret/data/app/database
      property: url

5. Image Signing & Verification

# Sign images with Cosign (2025 standard)
cosign sign --key cosign.key myregistry/myapp:v1.0

# Verify before deployment
cosign verify --key cosign.pub myregistry/myapp:v1.0

Health Checks and Self-Healing {#health-checks}

Proper health checks are the difference between 99.9% and 99.99% uptime.

Dockerfile Health Checks

# Basic HTTP health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
    CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

# Advanced health check with dependencies
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8080/health/ready || exit 1

Application-Level Health Endpoints

// Express.js health check pattern
const express = require('express');
const app = express();

let isReady = false;

// Liveness: Is the application running?
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive', timestamp: Date.now() });
});

// Readiness: Is the application ready to serve traffic?
app.get('/health/ready', async (req, res) => {
  try {
    // Check database connection
    await db.ping();
    // Check Redis connection
    await redis.ping();
    // Check external API dependencies
    await checkExternalServices();

    res.status(200).json({ 
      status: 'ready', 
      timestamp: Date.now(),
      dependencies: { db: 'ok', cache: 'ok', apis: 'ok' }
    });
  } catch (error) {
    res.status(503).json({ 
      status: 'not ready', 
      error: error.message,
      timestamp: Date.now()
    });
  }
});

// Startup: Has initialization completed?
app.get('/health/startup', (req, res) => {
  if (isReady) {
    res.status(200).json({ status: 'started' });
  } else {
    res.status(503).json({ status: 'starting' });
  }
});

Kubernetes Probes (Production Pattern)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1.0
        ports:
        - containerPort: 8080

        # Startup probe: Gives app time to initialize
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          failureThreshold: 30
          periodSeconds: 10

        # Liveness probe: Restart if unhealthy
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Readiness probe: Remove from service if not ready
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3

Critical Insight: Separate liveness from readiness. Liveness failures restart pods; readiness failures just remove them from load balancers. A dependency failure should affect readiness, not liveness.

Environment Configuration & Secrets {#configuration-management}

Configuration management makes or breaks production deployments. Here's the hierarchy I use:

Configuration Hierarchy

1. Secrets (never in code or config files)
2. Environment variables (deployment-specific)
3. Config files (mounted as volumes)
4. Application defaults (in code)

Docker Compose Production Pattern

version: '3.8'

services:
  api:
    image: ${REGISTRY}/myapp:${VERSION}
    environment:
      - NODE_ENV=production
      - LOG_LEVEL=${LOG_LEVEL:-info}
      - DATABASE_URL=${DATABASE_URL}
    env_file:
      - .env.production
    secrets:
      - db_password
      - jwt_secret
    configs:
      - source: app_config
        target: /app/config/production.yml
    deploy:
      replicas: 3
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
        monitor: 30s
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '1'
          memory: 1G

secrets:
  db_password:
    external: true
  jwt_secret:
    external: true

configs:
  app_config:
    file: ./config/production.yml

Kubernetes ConfigMap + Secret Pattern

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  app.yml: |
    server:
      port: 8080
      timeout: 30s
    features:
      newFeature: true
    logging:
      level: info
---
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        volumeMounts:
        - name: config
          mountPath: /app/config
          readOnly: true
        - name: secrets
          mountPath: /app/secrets
          readOnly: true
      volumes:
      - name: config
        configMap:
          name: app-config
      - name: secrets
        secret:
          secretName: app-secrets

Production Logging & Observability {#logging-observability}

Logging is not optional. Here's my production stack:

Structured Logging Pattern

// Winston configuration for production
const winston = require('winston');
const { ElasticsearchTransport } = require('winston-elasticsearch');

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: { 
    service: 'myapp',
    version: process.env.VERSION,
    environment: process.env.NODE_ENV
  },
  transports: [
    // Console for Docker logs
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      )
    }),
    // Elasticsearch for centralized logging
    new ElasticsearchTransport({
      level: 'info',
      clientOpts: { 
        node: process.env.ELASTICSEARCH_URL,
        auth: {
          username: process.env.ES_USER,
          password: process.env.ES_PASSWORD
        }
      }
    })
  ],
  exceptionHandlers: [
    new winston.transports.File({ filename: 'exceptions.log' })
  ],
  rejectionHandlers: [
    new winston.transports.File({ filename: 'rejections.log' })
  ]
});

// Request correlation middleware
app.use((req, res, next) => {
  req.id = req.headers['x-request-id'] || uuid.v4();
  req.logger = logger.child({ requestId: req.id });
  next();
});

Docker Logging Configuration

# docker-compose.yml
services:
  api:
    image: myapp:latest
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
        labels: "service,environment"
    labels:
      service: "api"
      environment: "production"

Production Observability Stack (2025)

version: '3.8'

services:
  # Application
  myapp:
    image: myapp:latest
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
      - OTEL_SERVICE_NAME=myapp
      - OTEL_RESOURCE_ATTRIBUTES=environment=production,version=${VERSION}
    depends_on:
      - otel-collector

  # OpenTelemetry Collector
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yml"]
    volumes:
      - ./otel-collector-config.yml:/etc/otel-collector-config.yml
    ports:
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP HTTP receiver

  # Prometheus (Metrics)
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"

  # Grafana (Visualization)
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secret
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
    ports:
      - "3000:3000"

  # Loki (Logs)
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yml
      - loki-data:/loki

  # Tempo (Traces)
  tempo:
    image: grafana/tempo:latest
    command: [ "-config.file=/etc/tempo.yml" ]
    volumes:
      - ./tempo.yml:/etc/tempo.yml
      - tempo-data:/tmp/tempo

  # Jaeger (Alternative distributed tracing)
  jaeger:
    image: jaegertracing/all-in-one:latest
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "16686:16686"  # Jaeger UI
      - "14268:14268"  # Collector HTTP
      - "4317:4317"    # OTLP gRPC

volumes:
  prometheus-data:
  grafana-data:
  loki-data:
  tempo-data:

Application Instrumentation

// OpenTelemetry instrumentation
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
  }),
  metricReader: new PrometheusExporter({
    port: 9464,
  }),
  serviceName: 'myapp',
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.log('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

Orchestration: Kubernetes vs Docker Swarm {#orchestration-choice}

The eternal question. Here's my decision framework after deploying both in production:

Decision Matrix

Factor	Kubernetes	Docker Swarm
Team Size	5+ engineers	2-4 engineers
Complexity	High (steep learning curve)	Low (Docker-native)
Ecosystem	Massive (70%+ market share)	Limited but stable
Multi-cloud	Excellent	Limited
Resource Overhead	Higher	Lower
Advanced Features	StatefulSets, Jobs, CronJobs, Custom Resources	Basic orchestration
Community Support	Extensive	Limited
Best For	Large-scale, complex deployments	Small-medium deployments

When to Choose Kubernetes

Scale: Running 50+ services or 100+ containers
Multi-cloud: Deploying across AWS, GCP, Azure
Advanced patterns: Need service mesh, GitOps, custom operators
Team expertise: Engineers familiar with K8s
Ecosystem: Need Helm charts, operators, CNCF tools

When to Choose Docker Swarm

Simplicity: Small team, straightforward deployment
Docker-native: Already using Docker Compose
Resource-constrained: Edge deployments, small clusters
Quick deployment: Need to ship fast without K8s complexity
Learning curve: Team new to orchestration

Docker Swarm Production Setup

# Initialize swarm
docker swarm init --advertise-addr <MANAGER-IP>

# Add workers
docker swarm join --token <WORKER-TOKEN> <MANAGER-IP>:2377

# Deploy stack
docker stack deploy -c docker-compose.yml myapp

# Scale service
docker service scale myapp_api=5

# Rolling update
docker service update --image myapp:v2 myapp_api

# Monitor
docker service ls
docker service ps myapp_api

Kubernetes Production Setup (K3s for Edge/IoT)

# Install K3s (lightweight K8s)
curl -sfL https://get.k3s.io | sh -

# Deploy application
kubectl apply -f deployment.yml

# Scale
kubectl scale deployment myapp --replicas=5

# Rolling update
kubectl set image deployment/myapp app=myapp:v2

# Monitor
kubectl get pods
kubectl top pods
kubectl logs -f deployment/myapp

Hybrid Approach: K3s/K8s at Edge, K8s in Cloud

# Edge K3s cluster (resource-constrained)
apiVersion: v1
kind: Namespace
metadata:
  name: edge-production

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: edge-processor
  namespace: edge-production
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: processor
        image: myapp:edge
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
      nodeSelector:
        node-role.kubernetes.io/edge: "true"
      tolerations:
      - key: "node-role.kubernetes.io/edge"
        operator: "Exists"
        effect: "NoSchedule"

Domain-Specific Deployments {#domain-specific}

Fullstack Applications {#fullstack}

Frontend + Backend + Database Pattern

version: '3.8'

services:
  # Frontend (React/Next.js)
  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile.prod
    ports:
      - "80:80"
      - "443:443"
    depends_on:
      - backend
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - certbot-certs:/etc/letsencrypt
      - certbot-webroot:/var/www/certbot
    deploy:
      replicas: 2
      resources:
        limits:
          cpus: '0.5'
          memory: 256M

  # Backend (Node.js/Python/Go)
  backend:
    image: ${REGISTRY}/backend:${VERSION}
    environment:
      - NODE_ENV=production
      - DATABASE_URL=postgresql://postgres:5432/mydb
      - REDIS_URL=redis://redis:6379
    depends_on:
      - db
      - redis
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '1'
          memory: 1G
      restart_policy:
        condition: on-failure

  # Database (PostgreSQL)
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    volumes:
      - postgres-data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    secrets:
      - db_password
    deploy:
      placement:
        constraints:
          - node.labels.db == true

  # Cache (Redis)
  redis:
    image: redis:7-alpine
    command: redis-server --appendonly yes
    volumes:
      - redis-data:/data

  # Background Jobs (Celery/Bull)
  worker:
    image: ${REGISTRY}/backend:${VERSION}
    command: celery -A app.celery worker --loglevel=info
    depends_on:
      - redis
      - db
    deploy:
      replicas: 2

volumes:
  postgres-data:
  redis-data:
  certbot-certs:
  certbot-webroot:

secrets:
  db_password:
    external: true

Nginx Configuration for Production

# nginx.conf
upstream backend {
    least_conn;
    server backend:8080 max_fails=3 fail_timeout=30s;
    server backend:8080 max_fails=3 fail_timeout=30s;
    server backend:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name example.com www.example.com;

    # Redirect HTTP to HTTPS
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    server_name example.com www.example.com;

    ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;

    # Modern SSL configuration
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    # Security headers
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;

    # Static files
    location /static {
        alias /usr/share/nginx/html/static;
        expires 1y;
        add_header Cache-Control "public, immutable";
    }

    # API proxy
    location /api {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_connect_timeout 30s;
        proxy_send_timeout 30s;
        proxy_read_timeout 30s;
    }

    # SPA fallback
    location / {
        root /usr/share/nginx/html;
        try_files $uri $uri/ /index.html;
    }
}

AI/ML Model Serving {#ai-ml}

GPU-Accelerated ML Deployment

# Dockerfile for PyTorch/TensorFlow with GPU
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install ML frameworks
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy model and application
COPY models/ ./models/
COPY app.py .

# Non-root user
RUN useradd -m -u 1001 mluser && \
    chown -R mluser:mluser /app
USER mluser

# Expose API
EXPOSE 8000

# Run with Gunicorn + Uvicorn workers
CMD ["gunicorn", "app:app", \
     "--workers", "4", \
     "--worker-class", "uvicorn.workers.UvicornWorker", \
     "--bind", "0.0.0.0:8000", \
     "--timeout", "120"]

Kubernetes ML Deployment with GPU

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: model-server
        image: myregistry/ml-model:v1.0
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
            nvidia.com/gpu: 1
          limits:
            memory: "8Gi"
            cpu: "4"
            nvidia.com/gpu: 1
        env:
        - name: MODEL_PATH
          value: "/models/my-model"
        - name: BATCH_SIZE
          value: "32"
        volumeMounts:
        - name: models
          mountPath: /models
          readOnly: true
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-storage
      nodeSelector:
        accelerator: nvidia-tesla-t4
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

FastAPI ML Serving Pattern

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
import numpy as np
from typing import List
import logging

app = FastAPI()

# Load model at startup
model = None

@app.on_event("startup")
async def load_model():
    global model
    model = torch.load('/models/my-model.pth')
    model.eval()
    logging.info("Model loaded successfully")

class PredictionRequest(BaseModel):
    data: List[List[float]]

class PredictionResponse(BaseModel):
    predictions: List[float]
    confidence: List[float]

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    try:
        input_tensor = torch.tensor(request.data, dtype=torch.float32)
        with torch.no_grad():
            output = model(input_tensor)
            predictions = output.argmax(dim=1).tolist()
            confidence = torch.softmax(output, dim=1).max(dim=1).values.tolist()

        return PredictionResponse(
            predictions=predictions,
            confidence=confidence
        )
    except Exception as e:
        logging.error(f"Prediction error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

@app.get("/metrics")
async def metrics():
    # Prometheus metrics endpoint
    return {"requests_total": 1000, "avg_latency_ms": 45}

MLOps Pipeline with Model Registry

version: '3.8'

services:
  # MLflow for experiment tracking
  mlflow:
    image: ghcr.io/mlflow/mlflow:latest
    command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://mlflow:password@db:5432/mlflow --default-artifact-root s3://mlflow-artifacts
    ports:
      - "5000:5000"
    environment:
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      - db

  # Model serving
  model-server:
    image: myregistry/ml-model:${MODEL_VERSION}
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - MODEL_NAME=my-production-model
      - MODEL_STAGE=Production
    depends_on:
      - mlflow
    deploy:
      replicas: 3
      resources:
        limits:
          nvidia.com/gpu: 1

IoT & Edge Computing {#iot-edge}

Edge Deployment with K3s

# Dockerfile for ARM64 edge devices
FROM arm64v8/python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    libgpiod2 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Run with resource constraints
CMD ["python3", "edge_processor.py"]

IoT Stack with MQTT

version: '3.8'

services:
  # MQTT Broker (Eclipse Mosquitto)
  mqtt:
    image: eclipse-mosquitto:2
    ports:
      - "1883:1883"
      - "9001:9001"
    volumes:
      - ./mosquitto.conf:/mosquitto/config/mosquitto.conf
      - mosquitto-data:/mosquitto/data
      - mosquitto-logs:/mosquitto/log

  # IoT Gateway
  gateway:
    image: myregistry/iot-gateway:latest
    environment:
      - MQTT_BROKER=mqtt://mqtt:1883
      - DEVICE_ID=${DEVICE_ID}
      - CLOUD_ENDPOINT=${CLOUD_ENDPOINT}
    depends_on:
      - mqtt
    devices:
      - "/dev/ttyUSB0:/dev/ttyUSB0"
    privileged: true
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 256M

  # Edge Analytics
  analytics:
    image: myregistry/edge-analytics:latest
    environment:
      - MQTT_BROKER=mqtt://mqtt:1883
      - INFLUXDB_URL=http://influxdb:8086
    depends_on:
      - mqtt
      - influxdb

  # Time-series Database
  influxdb:
    image: influxdb:2.7-alpine
    ports:
      - "8086:8086"
    volumes:
      - influxdb-data:/var/lib/influxdb2
    environment:
      - INFLUXDB_DB=iot_data
      - INFLUXDB_HTTP_AUTH_ENABLED=true

  # Grafana for visualization
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - influxdb

volumes:
  mosquitto-data:
  mosquitto-logs:
  influxdb-data:
  grafana-data:

Edge Computing Best Practices

# edge_processor.py - Optimized for resource-constrained devices
import paho.mqtt.client as mqtt
import json
import logging
from collections import deque
import time

class EdgeProcessor:
    def __init__(self):
        self.mqtt_client = mqtt.Client()
        self.buffer = deque(maxlen=1000)  # Circular buffer
        self.batch_size = 100
        self.last_upload = time.time()

    def process_sensor_data(self, data):
        # Edge processing: Filter noise, aggregate, compress
        if self.is_valid(data):
            processed = self.preprocess(data)
            self.buffer.append(processed)

            # Batch upload to cloud
            if len(self.buffer) >= self.batch_size or \
               time.time() - self.last_upload > 300:  # 5 min
                self.upload_batch()

    def preprocess(self, data):
        # Run lightweight inference on edge
        return {
            'timestamp': data['timestamp'],
            'value': data['value'],
            'anomaly': self.detect_anomaly(data['value'])
        }

    def upload_batch(self):
        if self.buffer:
            batch = list(self.buffer)
            self.mqtt_client.publish('cloud/data', json.dumps(batch))
            self.buffer.clear()
            self.last_upload = time.time()

Robotics Systems (ROS/ROS2) {#robotics}

ROS2 Docker Deployment

# Dockerfile for ROS2 Humble
FROM ros:humble-ros-base-jammy

# Install dependencies
RUN apt-get update && apt-get install -y \
    ros-humble-navigation2 \
    ros-humble-slam-toolbox \
    ros-humble-robot-localization \
    python3-colcon-common-extensions \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /ros2_ws

# Copy workspace
COPY src/ src/

# Build ROS2 workspace
RUN . /opt/ros/humble/setup.sh && \
    colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release

# Setup entrypoint
COPY ./ros_entrypoint.sh /
RUN chmod +x /ros_entrypoint.sh

ENTRYPOINT ["/ros_entrypoint.sh"]
CMD ["ros2", "launch", "my_robot", "robot.launch.py"]

Multi-Robot Fleet Management

version: '3.8'

services:
  # ROS Master / Discovery Server
  ros2-discovery:
    image: ros:humble
    command: ros2 daemon start
    network_mode: host
    environment:
      - ROS_DOMAIN_ID=0

  # Robot 1
  robot1:
    image: myregistry/robot:v1.0
    environment:
      - ROBOT_ID=robot1
      - ROS_DOMAIN_ID=0
      - ROBOT_NAMESPACE=/robot1
    devices:
      - /dev/video0:/dev/video0
      - /dev/ttyACM0:/dev/ttyACM0
    privileged: true
    network_mode: host

  # Robot 2
  robot2:
    image: myregistry/robot:v1.0
    environment:
      - ROBOT_ID=robot2
      - ROS_DOMAIN_ID=0
      - ROBOT_NAMESPACE=/robot2
    devices:
      - /dev/video1:/dev/video1
      - /dev/ttyACM1:/dev/ttyACM1
    privileged: true
    network_mode: host

  # Fleet Manager
  fleet-manager:
    image: myregistry/fleet-manager:latest
    ports:
      - "8080:8080"
    environment:
      - ROS_DOMAIN_ID=0
    network_mode: host
    depends_on:
      - ros2-discovery

  # Visualization (RViz)
  rviz:
    image: myregistry/robot:v1.0
    command: ros2 run rviz2 rviz2
    environment:
      - DISPLAY=$DISPLAY
      - ROS_DOMAIN_ID=0
    volumes:
      - /tmp/.X11-unix:/tmp/.X11-unix:rw
    network_mode: host

Scaling Architecture Patterns {#scaling-patterns}

Horizontal Pod Autoscaling (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 4
        periodSeconds: 30
      selectPolicy: Max

Vertical Pod Autoscaling (VPA)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myapp-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: myapp
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi
      controlledResources: ["cpu", "memory"]

Cluster Autoscaling (Cloud Providers)

# AWS EKS Node Group with autoscaling
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production-cluster
  region: us-west-2

managedNodeGroups:
  - name: general-purpose
    instanceType: t3.xlarge
    minSize: 3
    maxSize: 10
    desiredCapacity: 5
    volumeSize: 100
    ssh:
      allow: false
    labels:
      role: general
    tags:
      nodegroup-role: general
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
        ebs: true

  - name: gpu-nodes
    instanceType: g4dn.xlarge
    minSize: 0
    maxSize: 5
    desiredCapacity: 0
    volumeSize: 200
    labels:
      accelerator: nvidia-tesla-t4
    taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule

Service Mesh for Advanced Traffic Management

# Istio VirtualService for canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - myapp.example.com
  http:
  - match:
    - headers:
        user-agent:
          regex: ".*Mobile.*"
    route:
    - destination:
        host: myapp
        subset: v2
      weight: 100
  - route:
    - destination:
        host: myapp
        subset: v1
      weight: 90
    - destination:
        host: myapp
        subset: v2
      weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp
spec:
  host: myapp
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

Database Scaling Patterns

# PostgreSQL with replication
version: '3.8'

services:
  postgres-primary:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
      POSTGRES_REPLICATION_MODE: master
      POSTGRES_REPLICATION_USER: replicator
      POSTGRES_REPLICATION_PASSWORD_FILE: /run/secrets/repl_password
    volumes:
      - postgres-primary-data:/var/lib/postgresql/data
    deploy:
      placement:
        constraints:
          - node.labels.db.primary == true

  postgres-replica1:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
      POSTGRES_REPLICATION_MODE: slave
      POSTGRES_MASTER_SERVICE: postgres-primary
      POSTGRES_REPLICATION_USER: replicator
      POSTGRES_REPLICATION_PASSWORD_FILE: /run/secrets/repl_password
    volumes:
      - postgres-replica1-data:/var/lib/postgresql/data
    depends_on:
      - postgres-primary

  # Read-only connection pooler
  pgbouncer:
    image: pgbouncer/pgbouncer:latest
    environment:
      - DATABASES_HOST=postgres-primary
      - DATABASES_PORT=5432
      - DATABASES_DBNAME=mydb
      - PGBOUNCER_POOL_MODE=transaction
      - PGBOUNCER_MAX_CLIENT_CONN=1000
      - PGBOUNCER_DEFAULT_POOL_SIZE=25
    ports:
      - "6432:6432"

CI/CD Integration & GitOps {#cicd-gitops}

GitHub Actions CI/CD Pipeline

# .github/workflows/deploy.yml
name: Build and Deploy

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run tests
        run: |
          docker compose -f docker-compose.test.yml up --abort-on-container-exit
          docker compose -f docker-compose.test.yml down

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build image
        run: docker build -t ${{ env.IMAGE_NAME}}:${{ github.sha }} .

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.IMAGE_NAME }}:${{ github.sha }}
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'

      - name: Upload Trivy results to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        if: always()
        with:
          sarif_file: 'trivy-results.sarif'

  build-and-push:
    needs: [test, security-scan]
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=ref,event=pr
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}
            type=sha,prefix={{branch}}-

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Sign image with Cosign
        run: |
          cosign sign --yes ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}
        env:
          COSIGN_EXPERIMENTAL: "true"

  deploy-staging:
    needs: build-and-push
    if: github.ref == 'refs/heads/develop'
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - name: Deploy to staging
        run: |
          kubectl set image deployment/myapp \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:develop-${{ github.sha }} \
            --namespace=staging

  deploy-production:
    needs: build-and-push
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to production
        run: |
          kubectl set image deployment/myapp \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace=production

      - name: Wait for rollout
        run: |
          kubectl rollout status deployment/myapp --namespace=production --timeout=5m

      - name: Run smoke tests
        run: |
          curl -f https://api.example.com/health || (kubectl rollout undo deployment/myapp --namespace=production && exit 1)

GitOps with ArgoCD

# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default

  source:
    repoURL: https://github.com/myorg/myapp-k8s-manifests
    targetRevision: HEAD
    path: overlays/production
    kustomize:
      images:
      - myregistry/myapp:v1.2.3

  destination:
    server: https://kubernetes.default.svc
    namespace: production

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
    - CreateNamespace=true
    - PrunePropagationPolicy=foreground
    - PruneLast=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

  revisionHistoryLimit: 10

Blue-Green Deployment Strategy

# Blue-Green with Kubernetes
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # Switch to 'green' for deployment
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: app
        image: myapp:v1.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: app
        image: myapp:v2.0

Monitoring & Troubleshooting {#monitoring}

Prometheus Monitoring Setup

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

rule_files:
  - /etc/prometheus/alerts/*.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'docker'
    docker_sd_configs:
    - host: unix:///var/run/docker.sock
    relabel_configs:
    - source_labels: [__meta_docker_container_name]
      target_label: container

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__

Alert Rules

# alerts.yml
groups:
- name: application_alerts
  interval: 30s
  rules:
  - alert: HighErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m]))
        /
        sum(rate(http_requests_total[5m]))
      ) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

  - alert: HighMemoryUsage
    expr: |
      (
        container_memory_usage_bytes{name!=""}
        /
        container_spec_memory_limit_bytes{name!=""}
      ) > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container {{ $labels.name }} high memory usage"
      description: "Memory usage is {{ $value | humanizePercentage }}"

  - alert: PodCrashLooping
    expr: |
      rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

  - alert: DeploymentReplicasMismatch
    expr: |
      kube_deployment_spec_replicas
      !=
      kube_deployment_status_replicas_available
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch"

Grafana Dashboards (Provisioned)

# grafana/dashboards/app-dashboard.json (simplified)
{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)"
          }
        ]
      },
      {
        "title": "Response Time (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))"
          }
        ]
      }
    ]
  }
}

Distributed Tracing

// OpenTelemetry tracing setup
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'myapp',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION,
    environment: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': {
        enabled: false,
      },
    }),
  ],
});

sdk.start();

Debugging Containers

# Essential debugging commands

# View logs
docker logs -f <container-id>
kubectl logs -f deployment/myapp
kubectl logs -f deployment/myapp --previous  # Previous container

# Execute commands in container
docker exec -it <container-id> /bin/sh
kubectl exec -it deployment/myapp -- /bin/sh

# Check resource usage
docker stats
kubectl top pods
kubectl top nodes

# Describe resources
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'

# Debug networking
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- /bin/bash
# Inside debug pod:
# nslookup myservice
# curl myservice:8080/health
# tcpdump -i any port 8080

# Port forwarding
kubectl port-forward deployment/myapp 8080:8080

# Copy files from container
kubectl cp <pod-name>:/app/logs ./local-logs

# View cluster info
kubectl cluster-info dump

Future-Proofing Your Deployments {#future-trends}

Emerging Trends for 2025-2027

1. WebAssembly (Wasm) Containers

# Future: Wasm-based microVMs
FROM scratch
COPY --from=build /app/main.wasm /
CMD ["/main.wasm"]

2. eBPF for Observability

Deep kernel-level insights without code changes
Better security and network monitoring
Tools: Cilium, Falco, Pixie

3. Platform Engineering & Internal Developer Platforms

# Backstage + Kubernetes for self-service
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: myapp
spec:
  type: service
  lifecycle: production
  owner: team-backend
  system: core-platform
  providesApis:
    - myapi-v1
  consumesApis:
    - auth-api
    - payment-api

4. Green Computing & Carbon-Aware Scheduling

# Schedule workloads based on carbon intensity
apiVersion: v1
kind: Pod
spec:
  schedulerName: carbon-aware-scheduler
  nodeSelector:
    carbon-intensity: low

5. AI-Driven Operations (AIOps)

Predictive scaling based on ML models
Anomaly detection in metrics/logs
Automated incident response

Checklist for Production Readiness

[ ] Multi-stage builds with minimal base images
[ ] Non-root users in all containers
[ ] Vulnerability scanning in CI/CD
[ ] Secrets managed externally (Vault, AWS Secrets Manager)
[ ] Health checks (liveness, readiness, startup)
[ ] Resource requests and limits defined
[ ] Horizontal and vertical autoscaling configured
[ ] Monitoring and alerting set up
[ ] Distributed tracing implemented
[ ] Structured logging with correlation IDs
[ ] Backup and disaster recovery plan
[ ] Blue-green or canary deployment strategy
[ ] Network policies defined
[ ] Pod Security Standards enforced
[ ] GitOps workflow established
[ ] Documentation for runbooks and incident response

Conclusion

Production Docker deployments in 2025 require a holistic approach that goes far beyond writing Dockerfiles. Success comes from:

Security by default: Non-root users, minimal images, vulnerability scanning
Observability first: Metrics, logs, traces from day one
Scale-ready architecture: Design for horizontal scaling with stateless services
Automation everywhere: CI/CD, GitOps, auto-scaling, self-healing
Domain-specific optimizations: Tailor your approach for fullstack, AI/ML, IoT, or robotics

The container orchestration landscape continues to evolve, but these fundamental principles remain constant. Whether you choose Kubernetes for its ecosystem or Docker Swarm for simplicity, focus on building resilient, observable, and secure systems.

Remember: The best deployment strategy is one that your team can actually maintain. Start simple, measure everything, and iterate based on real production data.

Stay curious, keep learning, and happy deploying!

Originally published at padawanabhi.de

Table of Contents