Series: From "Just Put It on a Server" to Production DevOps
Reading time: 20 minutes
Level: Intermediate to Advanced
The Journey So Far
We've come a long way:
Part 1: Deployed manually (SSH + npm start)
Part 2: Added process management (PM2)
Part 3: Containerized with Docker
Part 4: Orchestrated locally (Docker Compose)
Part 5: Broke things to feel the pain
Part 6: Moved to Kubernetes
Part 7: Automated infrastructure (Terraform)
Part 8: Packaged apps (Helm)
Part 9: Automated deployments (Argo CD)
Now what? You're still not done.
Your CTO walks over:
"Great! Now make it handle 10x traffic, cost 50% less, survive disasters, and keep hackers out."
This is Part 10. Production reality.
Scaling: Handling Real Load
Horizontal Pod Autoscaling (HPA)
Problem: Traffic spikes at 9 AM. 3 API pods can't handle it. Users see 500 errors.
Solution: Auto-scale based on CPU/memory usage.
# infrastructure/k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
What this does:
- Starts with 3 replicas
- Scales up when CPU > 70% or memory > 80%
- Max 20 replicas (prevents runaway costs)
- Scale up gradually (50% every 60s)
- Scale down slowly (1 pod every 60s, wait 5 min before scaling down)
Apply:
kubectl apply -f infrastructure/k8s/hpa.yaml
Watch it work:
# Generate load
kubectl run -it --rm load-generator --image=busybox /bin/sh
# Inside pod:
while true; do wget -q -O- http://api-service:3000/api/v1/health; done
# Watch scaling
kubectl get hpa -w
Queue-Based Worker Scaling
Problem: 10,000 events arrive in 1 minute. Worker can't keep up.
Solution: Scale workers based on queue length.
Use KEDA (Kubernetes Event-Driven Autoscaling):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-scaler
spec:
scaleTargetRef:
name: worker
minReplicaCount: 2
maxReplicaCount: 50
triggers:
- type: redis
metadata:
address: redis:6379
listName: signal-processing-queue
listLength: "10"
What this does:
- Scales workers when queue has >10 items
- 1 worker per 10 jobs
- Max 50 workers
Observability: Know What's Happening
Structured Logging
Bad logging:
console.log('Event received');
console.log('Error:', err);
Good logging:
import logger from './logger';
logger.info('Event received', {
accountId: event.accountId,
eventType: event.eventType,
userId: event.userId,
timestamp: Date.now(),
});
logger.error('Database connection failed', {
error: err.message,
stack: err.stack,
retries: 3,
service: 'api',
});
Why? You can search, filter, and aggregate structured logs.
Health Checks
Liveness probe: Is the container alive?
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
Readiness probe: Should we send traffic?
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
Implement health endpoints:
// services/api/src/health.controller.ts
@Get('/health')
async liveness() {
// Basic check: is the server running?
return { status: 'ok', timestamp: Date.now() };
}
@Get('/health/ready')
async readiness() {
// Deep check: can we serve traffic?
const checks = await Promise.all([
this.checkDatabase(),
this.checkRedis(),
this.checkElasticsearch(),
]);
const allHealthy = checks.every(c => c.healthy);
const status = allHealthy ? 200 : 503;
return { status: allHealthy ? 'ready' : 'not ready', checks };
}
private async checkDatabase() {
try {
await this.db.query('SELECT 1');
return { service: 'database', healthy: true };
} catch (err) {
return { service: 'database', healthy: false, error: err.message };
}
}
Metrics with Prometheus
Expose metrics endpoint:
// services/api/src/metrics.controller.ts
import { Counter, Histogram, register } from 'prom-client';
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
});
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
});
@Get('/metrics')
async metrics() {
return register.metrics();
}
// Middleware to track metrics
export function metricsMiddleware(req, res, next) {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestsTotal.inc({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode });
httpRequestDuration.observe({ method: req.method, route: req.route?.path || 'unknown' }, duration);
});
next();
}
Security: Defense in Depth
RBAC (Role-Based Access Control)
Don't give everyone cluster-admin.
# infrastructure/k8s/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: developer-role
namespace: sspp-prod
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developer-binding
namespace: sspp-prod
subjects:
- kind: User
name: developer@company.com
roleRef:
kind: Role
name: developer-role
apiGroup: rbac.authorization.k8s.io
Principle: Give minimum permissions needed. Developers can read logs, not delete production.
Network Policies
Don't let every pod talk to every pod.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: postgres-policy
spec:
podSelector:
matchLabels:
app: postgres
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: api
- podSelector:
matchLabels:
app: worker
ports:
- protocol: TCP
port: 5432
Result: Only API and Worker can connect to PostgreSQL. Random pods can't steal data.
Secrets Management
Don't hardcode passwords.
apiVersion: v1
kind: Secret
metadata:
name: database-credentials
type: Opaque
stringData:
DATABASE_URL: postgresql://user:password@postgres:5432/sspp
Use in pods:
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-credentials
key: DATABASE_URL
Better: Use external secrets management (AWS Secrets Manager, HashiCorp Vault, Sealed Secrets).
Image Security
Don't run as root.
FROM node:18-alpine
# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
# Set ownership
WORKDIR /app
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --only=production
COPY --chown=nodejs:nodejs . .
# Run as non-root
USER nodejs
EXPOSE 3000
CMD ["node", "dist/main.js"]
Scan images:
# Use Trivy to scan for vulnerabilities
trivy image davidbrown77/sspp-api:latest
Cost Optimization
Right-Size Resources
Don't over-provision.
resources:
requests:
cpu: "100m" # Minimum needed
memory: "256Mi"
limits:
cpu: "1000m" # Maximum allowed
memory: "1Gi"
How to find right size:
- Monitor actual usage with Prometheus
- Set requests to P95 usage
- Set limits to P99 + 20% buffer
Use Spot Instances
For non-critical workloads:
# infrastructure/terraform/main.tf
resource "linode_lke_cluster" "sspp" {
label = "sspp-prod"
k8s_version = "1.28"
region = "us-east"
pool {
type = "g6-standard-4"
count = 3
# Use spot instances for worker pool
}
pool {
type = "g6-standard-2"
count = 5
# Spot instances for batch jobs
# 60-80% cheaper than on-demand
}
}
Autoscale Down
Don't run empty clusters overnight.
# Scale down non-prod environments at night
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa-dev
spec:
minReplicas: 0 # Scale to zero outside business hours
maxReplicas: 10
Use Karpenter or cluster-autoscaler to remove unused nodes.
Disaster Recovery
Database Backups
Automate backups with CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15
command:
- /bin/bash
- -c
- |
pg_dump $DATABASE_URL | gzip > /backups/backup-$(date +%Y%m%d-%H%M%S).sql.gz
# Upload to S3/Object Storage
aws s3 cp /backups/backup-$(date +%Y%m%d-%H%M%S).sql.gz s3://sspp-backups/
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-credentials
key: DATABASE_URL
restartPolicy: OnFailure
Test restores regularly:
# Every quarter, restore backup to test environment
pg_restore -d sspp_test < backup-20251222.sql.gz
GitOps Disaster Recovery
Why GitOps helps:
Your entire cluster state is in Git. If cluster is destroyed:
# Recreate cluster with Terraform
cd infrastructure/terraform
terraform apply
# ArgoCD syncs everything from Git
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
argocd app create sspp --repo https://github.com/daviesbrown/sspp --path infrastructure/k8s --dest-server https://kubernetes.default.svc --dest-namespace sspp-prod
argocd app sync sspp
Recovery time: 15-20 minutes (mostly waiting for Terraform/ArgoCD).
Alerts: Know Before Users Do
Prometheus AlertManager
# infrastructure/k8s/alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: sspp-alerts
spec:
groups:
- name: sspp
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "{{ $value }}% of requests are failing"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Container using >90% memory"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
Send to Slack/PagerDuty:
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
data:
alertmanager.yml: |
route:
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
When NOT to Scale
Scaling is not always the answer.
Optimize First
Before adding 10 more API pods:
- Profile the code - Is there an N+1 query?
- Add caching - Redis for hot data
- Optimize queries - Add indexes
- Use CDN - Offload static assets
Real example: Company spent $50k/month on compute. Fixed one database query. Cost dropped to $15k/month.
Know Your Limits
Your database doesn't scale horizontally easily.
- Postgres has connection limits (~100-300)
- Redis is single-threaded
- Elasticsearch needs careful tuning
Solution: Connection pooling, read replicas, sharding (when necessary).
DevOps is Decision-Making Under Uncertainty
You've learned the tools:
- ✅ Docker, Kubernetes, Terraform, Helm, Argo CD
- ✅ Monitoring, logging, alerting
- ✅ Security, scaling, cost optimization
But tools don't make decisions. You do.
Production questions you'll face:
- Should we scale now or optimize code first?
- Is 99.9% uptime enough, or do we need 99.99%?
- Do we need multi-region, or is one region acceptable?
- Should we use managed PostgreSQL or run our own?
- Is this alert worth waking someone up at 3 AM?
There are no perfect answers. Only tradeoffs.
Your job: Make informed tradeoffs based on:
- Business requirements
- Team size
- Budget constraints
- Risk tolerance
The Complete Architecture
What we built:
┌─────────────────────────────────────────────────────────┐
│ USER REQUEST │
└────────────────────┬────────────────────────────────────┘
│
▼
┌──────────────┐
│ Load Balancer│ (Linode NodeBalancer)
│ (HTTPS) │
└──────┬───────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ API Pod│ │ API Pod│ │ API Pod│ (Horizontal Pod Autoscaler)
└────┬───┘ └────┬───┘ └────┬───┘
│ │ │
└──────────┬──┴─────────────┘
│
┌────────┴────────┐
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ Redis │ │Postgres │ (StatefulSet)
│ (Queue) │ │ (DB) │
└────┬────┘ └─────────┘
│
▼
┌──────────┐
│ Worker │ (KEDA Autoscaler)
│ Pods │
└────┬─────┘
│
├──────────────┬───────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────────┐
│Postgres │ │ Redis │ │Elasticsearch │
└─────────┘ └──────────┘ └──────────────┘
────────────────────────────────────────────────────────
OBSERVABILITY & CONTROL
────────────────────────────────────────────────────────
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Prometheus │ │ Grafana │ │ Argo CD │
│ (Metrics) │ │ (Dashboards) │ │ (GitOps) │
└──────────────┘ └──────────────┘ └──────────────┘
────────────────────────────────────────────────────────
INFRASTRUCTURE LAYER
────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────┐
│ Kubernetes Cluster (Linode LKE) │
│ • 3-10 nodes (autoscaling) │
│ • Multiple availability zones │
│ • Managed control plane │
└──────────────────────────────────────────────────────┘
│
▼
┌────────────────┐
│ Terraform │ (Infrastructure as Code)
└────────────────┘
The Journey Recap
Part 1: Started with SSH and npm start (pain: everything dies)
Part 2: Added PM2 (pain: still manual, environment drift)
Part 3: Containerized with Docker (pain: local only)
Part 4: Local orchestration with Docker Compose (pain: no auto-recovery)
Part 5: Broke things intentionally (pain: realized we need orchestration)
Part 6: Kubernetes fundamentals (pain: too much YAML, manual clicks)
Part 7: Terraform for infrastructure (pain: still manual deploys)
Part 8: Helm for packaging (pain: no auto-sync)
Part 9: Argo CD for GitOps (pain: no visibility, security, cost control)
Part 10: Production hardening (scaling, observability, security, cost)
You're now running production SaaS infrastructure.
Try It Yourself
Complete production checklist:
- [ ] Horizontal Pod Autoscaling configured
- [ ] Resource limits on all containers
- [ ] Health checks (liveness + readiness)
- [ ] Structured logging with Winston
- [ ] Metrics endpoint exposed
- [ ] Prometheus + Grafana dashboards
- [ ] Alerts configured (Slack/PagerDuty)
- [ ] RBAC roles defined
- [ ] Network policies applied
- [ ] Secrets externalized
- [ ] Database backups automated
- [ ] Disaster recovery tested
- [ ] Cost monitoring enabled
- [ ] Security scanning in CI/CD
- [ ] Documentation up to date
If you can check all boxes, you're production-ready.
What's Next?
You've completed the journey. You now understand:
- Why tools exist (not just how to use them)
- How to think about tradeoffs
- What "production-ready" really means
- How to operate, not just deploy
Where to go from here:
- Multi-region deployments - Latency, disaster recovery
- Service mesh - Istio, Linkerd for advanced networking
- Chaos engineering - Intentional failure testing
- FinOps - Advanced cost optimization
- Compliance - SOC 2, HIPAA, GDPR for SaaS
But you have the foundation. Everything else builds on what you've learned.
Final Thoughts
DevOps is not about tools. It's about:
- Understanding problems before solutions
- Making systems reliable, not perfect
- Balancing speed, cost, and safety
- Operating with empathy (for users, teammates, on-call engineers)
You started with SSH and npm start.
You ended with GitOps-powered Kubernetes.
That's production DevOps.
Previous: Part 9: GitOps with Argo CD
About the Author
I built this 10-part series to demonstrate real DevOps thinking for my Proton.ai application. Every tool was introduced only after experiencing the pain it solves.
This is how production SaaS is built.
If you're hiring for DevOps/Platform roles and want someone who understands infrastructure (not just follows tutorials), let's talk.
- GitHub: @daviesbrown
- LinkedIn: David Nwosu
- Portfolio: github.com/daviesbrown/sspp
Thank you for reading.
If this series helped you, please:
- ⭐ Star the GitHub repository
- Share with others learning DevOps
- Open issues with questions or feedback
- Consider hiring me if you need this expertise on your team
From manual deployment to production-grade GitOps. You made it. 🚀
Top comments (0)