DEV Community

Cover image for Part 10: Scaling, Failure & Operating Like a Real Company
David Nwosu
David Nwosu

Posted on

Part 10: Scaling, Failure & Operating Like a Real Company

Series: From "Just Put It on a Server" to Production DevOps

Reading time: 20 minutes

Level: Intermediate to Advanced


The Journey So Far

We've come a long way:

Part 1: Deployed manually (SSH + npm start)

Part 2: Added process management (PM2)

Part 3: Containerized with Docker

Part 4: Orchestrated locally (Docker Compose)

Part 5: Broke things to feel the pain

Part 6: Moved to Kubernetes

Part 7: Automated infrastructure (Terraform)

Part 8: Packaged apps (Helm)

Part 9: Automated deployments (Argo CD)

Now what? You're still not done.

Your CTO walks over:

"Great! Now make it handle 10x traffic, cost 50% less, survive disasters, and keep hackers out."

This is Part 10. Production reality.


Scaling: Handling Real Load

Horizontal Pod Autoscaling (HPA)

Problem: Traffic spikes at 9 AM. 3 API pods can't handle it. Users see 500 errors.

Solution: Auto-scale based on CPU/memory usage.

# infrastructure/k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60
Enter fullscreen mode Exit fullscreen mode

What this does:

  • Starts with 3 replicas
  • Scales up when CPU > 70% or memory > 80%
  • Max 20 replicas (prevents runaway costs)
  • Scale up gradually (50% every 60s)
  • Scale down slowly (1 pod every 60s, wait 5 min before scaling down)

Apply:

kubectl apply -f infrastructure/k8s/hpa.yaml
Enter fullscreen mode Exit fullscreen mode

Watch it work:

# Generate load
kubectl run -it --rm load-generator --image=busybox /bin/sh
# Inside pod:
while true; do wget -q -O- http://api-service:3000/api/v1/health; done

# Watch scaling
kubectl get hpa -w
Enter fullscreen mode Exit fullscreen mode

Queue-Based Worker Scaling

Problem: 10,000 events arrive in 1 minute. Worker can't keep up.

Solution: Scale workers based on queue length.

Use KEDA (Kubernetes Event-Driven Autoscaling):

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaler
spec:
  scaleTargetRef:
    name: worker
  minReplicaCount: 2
  maxReplicaCount: 50
  triggers:
  - type: redis
    metadata:
      address: redis:6379
      listName: signal-processing-queue
      listLength: "10"
Enter fullscreen mode Exit fullscreen mode

What this does:

  • Scales workers when queue has >10 items
  • 1 worker per 10 jobs
  • Max 50 workers

Observability: Know What's Happening

Structured Logging

Bad logging:

console.log('Event received');
console.log('Error:', err);
Enter fullscreen mode Exit fullscreen mode

Good logging:

import logger from './logger';

logger.info('Event received', {
  accountId: event.accountId,
  eventType: event.eventType,
  userId: event.userId,
  timestamp: Date.now(),
});

logger.error('Database connection failed', {
  error: err.message,
  stack: err.stack,
  retries: 3,
  service: 'api',
});
Enter fullscreen mode Exit fullscreen mode

Why? You can search, filter, and aggregate structured logs.

Health Checks

Liveness probe: Is the container alive?

livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3
Enter fullscreen mode Exit fullscreen mode

Readiness probe: Should we send traffic?

readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2
Enter fullscreen mode Exit fullscreen mode

Implement health endpoints:

// services/api/src/health.controller.ts
@Get('/health')
async liveness() {
  // Basic check: is the server running?
  return { status: 'ok', timestamp: Date.now() };
}

@Get('/health/ready')
async readiness() {
  // Deep check: can we serve traffic?
  const checks = await Promise.all([
    this.checkDatabase(),
    this.checkRedis(),
    this.checkElasticsearch(),
  ]);

  const allHealthy = checks.every(c => c.healthy);
  const status = allHealthy ? 200 : 503;

  return { status: allHealthy ? 'ready' : 'not ready', checks };
}

private async checkDatabase() {
  try {
    await this.db.query('SELECT 1');
    return { service: 'database', healthy: true };
  } catch (err) {
    return { service: 'database', healthy: false, error: err.message };
  }
}
Enter fullscreen mode Exit fullscreen mode

Metrics with Prometheus

Expose metrics endpoint:

// services/api/src/metrics.controller.ts
import { Counter, Histogram, register } from 'prom-client';

const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
});

@Get('/metrics')
async metrics() {
  return register.metrics();
}

// Middleware to track metrics
export function metricsMiddleware(req, res, next) {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestsTotal.inc({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode });
    httpRequestDuration.observe({ method: req.method, route: req.route?.path || 'unknown' }, duration);
  });

  next();
}
Enter fullscreen mode Exit fullscreen mode

Security: Defense in Depth

RBAC (Role-Based Access Control)

Don't give everyone cluster-admin.

# infrastructure/k8s/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: developer-role
  namespace: sspp-prod
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: developer-binding
  namespace: sspp-prod
subjects:
- kind: User
  name: developer@company.com
roleRef:
  kind: Role
  name: developer-role
  apiGroup: rbac.authorization.k8s.io
Enter fullscreen mode Exit fullscreen mode

Principle: Give minimum permissions needed. Developers can read logs, not delete production.

Network Policies

Don't let every pod talk to every pod.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: postgres-policy
spec:
  podSelector:
    matchLabels:
      app: postgres
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api
    - podSelector:
        matchLabels:
          app: worker
    ports:
    - protocol: TCP
      port: 5432
Enter fullscreen mode Exit fullscreen mode

Result: Only API and Worker can connect to PostgreSQL. Random pods can't steal data.

Secrets Management

Don't hardcode passwords.

apiVersion: v1
kind: Secret
metadata:
  name: database-credentials
type: Opaque
stringData:
  DATABASE_URL: postgresql://user:password@postgres:5432/sspp
Enter fullscreen mode Exit fullscreen mode

Use in pods:

env:
- name: DATABASE_URL
  valueFrom:
    secretKeyRef:
      name: database-credentials
      key: DATABASE_URL
Enter fullscreen mode Exit fullscreen mode

Better: Use external secrets management (AWS Secrets Manager, HashiCorp Vault, Sealed Secrets).

Image Security

Don't run as root.

FROM node:18-alpine

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

# Set ownership
WORKDIR /app
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --only=production
COPY --chown=nodejs:nodejs . .

# Run as non-root
USER nodejs
EXPOSE 3000
CMD ["node", "dist/main.js"]
Enter fullscreen mode Exit fullscreen mode

Scan images:

# Use Trivy to scan for vulnerabilities
trivy image davidbrown77/sspp-api:latest
Enter fullscreen mode Exit fullscreen mode

Cost Optimization

Right-Size Resources

Don't over-provision.

resources:
  requests:
    cpu: "100m"      # Minimum needed
    memory: "256Mi"
  limits:
    cpu: "1000m"     # Maximum allowed
    memory: "1Gi"
Enter fullscreen mode Exit fullscreen mode

How to find right size:

  1. Monitor actual usage with Prometheus
  2. Set requests to P95 usage
  3. Set limits to P99 + 20% buffer

Use Spot Instances

For non-critical workloads:

# infrastructure/terraform/main.tf
resource "linode_lke_cluster" "sspp" {
  label = "sspp-prod"
  k8s_version = "1.28"
  region = "us-east"

  pool {
    type = "g6-standard-4"
    count = 3
    # Use spot instances for worker pool
  }

  pool {
    type = "g6-standard-2"
    count = 5
    # Spot instances for batch jobs
    # 60-80% cheaper than on-demand
  }
}
Enter fullscreen mode Exit fullscreen mode

Autoscale Down

Don't run empty clusters overnight.

# Scale down non-prod environments at night
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa-dev
spec:
  minReplicas: 0  # Scale to zero outside business hours
  maxReplicas: 10
Enter fullscreen mode Exit fullscreen mode

Use Karpenter or cluster-autoscaler to remove unused nodes.


Disaster Recovery

Database Backups

Automate backups with CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: postgres:15
            command:
            - /bin/bash
            - -c
            - |
              pg_dump $DATABASE_URL | gzip > /backups/backup-$(date +%Y%m%d-%H%M%S).sql.gz
              # Upload to S3/Object Storage
              aws s3 cp /backups/backup-$(date +%Y%m%d-%H%M%S).sql.gz s3://sspp-backups/
            env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: database-credentials
                  key: DATABASE_URL
          restartPolicy: OnFailure
Enter fullscreen mode Exit fullscreen mode

Test restores regularly:

# Every quarter, restore backup to test environment
pg_restore -d sspp_test < backup-20251222.sql.gz
Enter fullscreen mode Exit fullscreen mode

GitOps Disaster Recovery

Why GitOps helps:

Your entire cluster state is in Git. If cluster is destroyed:

# Recreate cluster with Terraform
cd infrastructure/terraform
terraform apply

# ArgoCD syncs everything from Git
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
argocd app create sspp --repo https://github.com/daviesbrown/sspp --path infrastructure/k8s --dest-server https://kubernetes.default.svc --dest-namespace sspp-prod
argocd app sync sspp
Enter fullscreen mode Exit fullscreen mode

Recovery time: 15-20 minutes (mostly waiting for Terraform/ArgoCD).


Alerts: Know Before Users Do

Prometheus AlertManager

# infrastructure/k8s/alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: sspp-alerts
spec:
  groups:
  - name: sspp
    interval: 30s
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "{{ $value }}% of requests are failing"

    - alert: HighMemoryUsage
      expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Container using >90% memory"

    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      labels:
        severity: critical
      annotations:
        summary: "Pod is crash looping"
Enter fullscreen mode Exit fullscreen mode

Send to Slack/PagerDuty:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
data:
  alertmanager.yml: |
    route:
      receiver: 'slack'
    receivers:
    - name: 'slack'
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Enter fullscreen mode Exit fullscreen mode

When NOT to Scale

Scaling is not always the answer.

Optimize First

Before adding 10 more API pods:

  1. Profile the code - Is there an N+1 query?
  2. Add caching - Redis for hot data
  3. Optimize queries - Add indexes
  4. Use CDN - Offload static assets

Real example: Company spent $50k/month on compute. Fixed one database query. Cost dropped to $15k/month.

Know Your Limits

Your database doesn't scale horizontally easily.

  • Postgres has connection limits (~100-300)
  • Redis is single-threaded
  • Elasticsearch needs careful tuning

Solution: Connection pooling, read replicas, sharding (when necessary).


DevOps is Decision-Making Under Uncertainty

You've learned the tools:

  • ✅ Docker, Kubernetes, Terraform, Helm, Argo CD
  • ✅ Monitoring, logging, alerting
  • ✅ Security, scaling, cost optimization

But tools don't make decisions. You do.

Production questions you'll face:

  • Should we scale now or optimize code first?
  • Is 99.9% uptime enough, or do we need 99.99%?
  • Do we need multi-region, or is one region acceptable?
  • Should we use managed PostgreSQL or run our own?
  • Is this alert worth waking someone up at 3 AM?

There are no perfect answers. Only tradeoffs.

Your job: Make informed tradeoffs based on:

  • Business requirements
  • Team size
  • Budget constraints
  • Risk tolerance

The Complete Architecture

What we built:

┌─────────────────────────────────────────────────────────┐
│                     USER REQUEST                         │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
              ┌──────────────┐
              │ Load Balancer│  (Linode NodeBalancer)
              │   (HTTPS)    │
              └──────┬───────┘
                     │
      ┌──────────────┼──────────────┐
      │              │              │
      ▼              ▼              ▼
 ┌────────┐    ┌────────┐    ┌────────┐
 │ API Pod│    │ API Pod│    │ API Pod│  (Horizontal Pod Autoscaler)
 └────┬───┘    └────┬───┘    └────┬───┘
      │             │             │
      └──────────┬──┴─────────────┘
                 │
        ┌────────┴────────┐
        │                 │
        ▼                 ▼
   ┌─────────┐       ┌─────────┐
   │  Redis  │       │Postgres │  (StatefulSet)
   │ (Queue) │       │   (DB)  │
   └────┬────┘       └─────────┘
        │
        ▼
   ┌──────────┐
   │  Worker  │  (KEDA Autoscaler)
   │   Pods   │
   └────┬─────┘
        │
        ├──────────────┬───────────────┐
        ▼              ▼               ▼
   ┌─────────┐   ┌──────────┐   ┌──────────────┐
   │Postgres │   │  Redis   │   │Elasticsearch │
   └─────────┘   └──────────┘   └──────────────┘

────────────────────────────────────────────────────────
              OBSERVABILITY & CONTROL
────────────────────────────────────────────────────────
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Prometheus  │  │    Grafana   │  │  Argo CD     │
│  (Metrics)   │  │ (Dashboards) │  │  (GitOps)    │
└──────────────┘  └──────────────┘  └──────────────┘

────────────────────────────────────────────────────────
              INFRASTRUCTURE LAYER
────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────┐
│         Kubernetes Cluster (Linode LKE)               │
│  • 3-10 nodes (autoscaling)                          │
│  • Multiple availability zones                        │
│  • Managed control plane                             │
└──────────────────────────────────────────────────────┘
                     │
                     ▼
            ┌────────────────┐
            │   Terraform    │  (Infrastructure as Code)
            └────────────────┘
Enter fullscreen mode Exit fullscreen mode

The Journey Recap

Part 1: Started with SSH and npm start (pain: everything dies)

Part 2: Added PM2 (pain: still manual, environment drift)

Part 3: Containerized with Docker (pain: local only)

Part 4: Local orchestration with Docker Compose (pain: no auto-recovery)

Part 5: Broke things intentionally (pain: realized we need orchestration)

Part 6: Kubernetes fundamentals (pain: too much YAML, manual clicks)

Part 7: Terraform for infrastructure (pain: still manual deploys)

Part 8: Helm for packaging (pain: no auto-sync)

Part 9: Argo CD for GitOps (pain: no visibility, security, cost control)

Part 10: Production hardening (scaling, observability, security, cost)

You're now running production SaaS infrastructure.


Try It Yourself

Complete production checklist:

  • [ ] Horizontal Pod Autoscaling configured
  • [ ] Resource limits on all containers
  • [ ] Health checks (liveness + readiness)
  • [ ] Structured logging with Winston
  • [ ] Metrics endpoint exposed
  • [ ] Prometheus + Grafana dashboards
  • [ ] Alerts configured (Slack/PagerDuty)
  • [ ] RBAC roles defined
  • [ ] Network policies applied
  • [ ] Secrets externalized
  • [ ] Database backups automated
  • [ ] Disaster recovery tested
  • [ ] Cost monitoring enabled
  • [ ] Security scanning in CI/CD
  • [ ] Documentation up to date

If you can check all boxes, you're production-ready.


What's Next?

You've completed the journey. You now understand:

  • Why tools exist (not just how to use them)
  • How to think about tradeoffs
  • What "production-ready" really means
  • How to operate, not just deploy

Where to go from here:

  1. Multi-region deployments - Latency, disaster recovery
  2. Service mesh - Istio, Linkerd for advanced networking
  3. Chaos engineering - Intentional failure testing
  4. FinOps - Advanced cost optimization
  5. Compliance - SOC 2, HIPAA, GDPR for SaaS

But you have the foundation. Everything else builds on what you've learned.


Final Thoughts

DevOps is not about tools. It's about:

  • Understanding problems before solutions
  • Making systems reliable, not perfect
  • Balancing speed, cost, and safety
  • Operating with empathy (for users, teammates, on-call engineers)

You started with SSH and npm start.

You ended with GitOps-powered Kubernetes.

That's production DevOps.


Previous: Part 9: GitOps with Argo CD

About the Author

I built this 10-part series to demonstrate real DevOps thinking for my Proton.ai application. Every tool was introduced only after experiencing the pain it solves.

This is how production SaaS is built.

If you're hiring for DevOps/Platform roles and want someone who understands infrastructure (not just follows tutorials), let's talk.


Thank you for reading.

If this series helped you, please:

  • ⭐ Star the GitHub repository
  • Share with others learning DevOps
  • Open issues with questions or feedback
  • Consider hiring me if you need this expertise on your team

From manual deployment to production-grade GitOps. You made it. 🚀

Top comments (0)