If you're still deploying code by SSHing into a server and running git pull, this guide is for you.
Modern deployment isn't magic. It's automation. Consistency. Confidence.
Let's build that.
Part 1: The Pipeline – From Code to Production
Every production application needs this flow:
Code → Build → Test → Deploy → Monitor
↓
Feedback & Rollback if needed
Each stage is automated. When something fails, you catch it before users do.
Stage 1: Code Commit (Git Workflow)
Your repository structure matters. A lot.
my-app/
├── .github/workflows/ # CI/CD pipelines
├── src/ # Source code
├── tests/ # Unit & integration tests
├── docker/ # Dockerfile & related
├── k8s/ # Kubernetes manifests
├── terraform/ # Infrastructure as code
├── README.md
└── .gitignore
Branch strategy that works:
main (production)
↑ (merge only via PR)
develop (staging)
↑ (merge feature branches)
feature/new-dashboard (your work)
feature/user-auth (teammate's work)
hotfix/critical-bug (urgent fix)
Rule: Never push to main directly. Always go through develop, create a pull request, get code review, run automated tests.
bash# You're working on a feature
git checkout -b feature/new-dashboard
git commit -m "feat(dashboard): add charts"
git push origin feature/new-dashboard
Create PR on GitHub/GitLab
Automated tests run
Code review happens
Merge to develop
Automated deploy to staging
Test on staging
When ready, merge develop → main
Automated deploy to production
Stage 2: Build (Docker)
Stop deploying Python/Node/Go directly. Use containers.
dockerfile# Dockerfile
FROM node:18-alpine
WORKDIR /app
Copy package files
COPY package*.json ./
Install dependencies
RUN npm ci
Copy source
COPY src ./src
Health check
HEALTHCHECK --interval=30s --timeout=5s \
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {if (r.statusCode !== 200) throw new Error(r.statusCode)})"
Expose port
EXPOSE 3000
Start app
CMD ["npm", "start"]
Why Docker:
✅ "Works on my machine" → "Works everywhere"
✅ Version lock entire dependencies
✅ Easy to scale (run 10 copies)
✅ Security isolation
✅ Simple rollback (just switch image version)
Build optimization:
dockerfile# Bad: Bloated image
FROM node:18
COPY . .
RUN npm install
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]
Result: ~500MB image
Good: Multi-stage build
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY src ./src
RUN npm run build
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package.json .
EXPOSE 3000
CMD ["npm", "start"]
Result: ~120MB image
Stage 3: Test (Automated)
Your CI pipeline runs tests automatically:
yaml# .github/workflows/ci.yml
name: CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [develop]
jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: testpass
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: '18'
- name: Install dependencies
run: npm ci
- name: Run linting
run: npm run lint
- name: Run unit tests
run: npm run test:unit
- name: Run integration tests
run: npm run test:integration
env:
DATABASE_URL: postgres://user:testpass@localhost/testdb
- name: Upload coverage
uses: codecov/codecov-action@v3
with:
files: ./coverage/lcov.info
build:
needs: test # Only run if tests pass
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Build Docker image
run: |
docker build -t myapp:${{ github.sha }} .
docker tag myapp:${{ github.sha }} myapp:latest
- name: Push to registry
run: |
docker login -u ${{ secrets.REGISTRY_USER }} -p ${{ secrets.REGISTRY_TOKEN }}
docker push myapp:${{ github.sha }}
What this does:
Runs linting (catch style issues)
Runs unit tests (catch logic errors)
Runs integration tests with real database (catch integration issues)
Only if ALL pass, build Docker image
Push to registry
One failing test = no deploy. That's the point.
Stage 4: Deploy (Multiple Strategies)
Blue-Green Deployment (Safest)
Blue (current): v1.2.3 (users hitting this)
Green (new): v1.3.0 (being deployed)
Steps:
- Deploy v1.3.0 to green
- Run smoke tests on green
- If good: Switch traffic from blue → green
- If bad: Switch back to blue (instant rollback)
- Keep blue running for 1 hour (safety net) Canary Deployment (Progressive) Version 1.2.3: 95% of traffic Version 1.3.0: 5% of traffic
Monitor:
- Error rates
- Response times
- Business metrics
If all good, shift traffic:
- 1.3.0: 25% → 50% → 100%
If problems appear, rollback immediately
Rolling Deployment (Traditional)
Deploy gradually:
- Take 1 instance down, deploy new version
- Bring it up
- Repeat for next instance
- Users never experience full downtime
Downsides: Temporarily running mixed versions (harder to debug)
Part 2: Container Orchestration (Kubernetes Essentials)
You don't need to be a Kubernetes expert. You need to know:
Basic Concepts
Pod = Smallest unit (like a container wrapper)
Service = How pods talk to each other + expose to outside
Deployment = How you define what you want running
ConfigMap = Configuration (not secrets)
Secret = Passwords, API keys, etc (encrypted)
Simple Kubernetes Deployment
yaml# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3 # Run 3 copies
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:latest
ports:
- containerPort: 3000
# Health checks
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
# Resource limits
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
# Environment variables
env:
- name: LOG_LEVEL
value: "info"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
service.yaml
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
type: LoadBalancer # Expose to internet
selector:
app: myapp
ports:
- protocol: TCP port: 80 targetPort: 3000 What happens:
Kubernetes creates 3 pods running your app
If one crashes, it's replaced automatically
Load balancer distributes traffic
Rolling updates: New version gradually replaces old
Auto-scaling
yamlapiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
- type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 Auto-scales from 3 to 10 pods based on CPU/memory usage.
Part 3: Monitoring & Alerting
Deployed code that isn't monitored is just waiting to fail silently.
The Three Pillars of Observability
- Logs (What happened) javascript// Structured logging logger.info('User login', { userId: user.id, timestamp: new Date(), ipAddress: req.ip, duration: 245 // ms });
// Output:
// {"level":"info","message":"User login","userId":"123","timestamp":"2024-05-20T...","ipAddress":"192.168.1.1","duration":245}
- Metrics (What's the state) javascript// Application metrics const httpDuration = new Histogram({ name: 'http_request_duration_ms', help: 'Duration of HTTP requests in ms', labelNames: ['method', 'route', 'status_code'] });
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
httpDuration.labels(req.method, req.route.path, res.statusCode).observe(duration);
});
next();
});
- Traces (How requests flow) javascript// Distributed tracing const span = tracer.startSpan('database.query'); const result = await db.query(sql); span.setTag('query', sql); span.finish();
// Shows: Request → Service A → Service B → Database
// Plus: Time spent at each step
Setting Up Alerts That Matter
yaml# Prometheus alert rules
groups:
-
name: app
rules:- alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m annotations: summary: "High error rate detected"
- alert: DatabaseConnectionPoolExhausted expr: db_connections_active / db_connections_max > 0.9 for: 2m annotations: summary: "Database connection pool 90% full"
- alert: HighLatency expr: histogram_quantile(0.95, http_request_duration_ms) > 1000 for: 5m annotations: summary: "p95 latency > 1 second" Key principle: Alert on symptoms, not facts.
❌ Alert: "CPU > 80%"
✅ Alert: "p95 latency > 1s" (high CPU matters only if users see it)
❌ Alert: "Disk 85% full"
✅ Alert: "Disk full in 24 hours at current rate" (gives time to act)
Part 4: Rollback & Recovery
Everything fails. What matters is how fast you recover.
Automated Rollback
yaml# GitHub Actions
name: Deploy to production
run: kubectl set image deployment/myapp myapp=myapp:v1.3.0name: Wait for rollout
run: kubectl rollout status deployment/myapp --timeout=5mname: Run smoke tests
run: npm run test:smokename: If tests fail, rollback
if: failure()
run: kubectl rollout undo deployment/myapp
Database Rollback
bash# If you deployed a database migration that breaks
Option 1: Have down migration (safe)
npm run migrate:down
npm run migrate:up # New fixed version
Option 2: Point-in-time recovery
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier prod-db-restored \
--db-snapshot-identifier prod-db-2024-05-20-03-00
The Runbook
Create this before you need it:
markdown# Incident Runbook: Database is Slow
Symptoms
- p95 latency > 5s
- Users complain app is slow
- CPU on database high
Immediate Actions (0-5 min)
- Check if we can scale database (vertical scale)
- Kill long-running queries: SELECT * FROM long_queries
- If query is a new deploy, rollback
Root Cause (5-30 min)
- Check recent deployments: When did this start?
- Check slow query log
- Check if query plan changed
Resolution
- Option A: Optimize query (add index)
- Option B: Rollback problematic deploy
- Option C: Scale database
Prevention
- Add query time monitoring
- Add alert for p95 latency
- Load test before deploy
Part 5: Common Mistakes (And How to Avoid Them)
Mistake 1: Deploying Too Big Changes
❌ 10,000 lines of code changed in one deploy
✅ 200-500 lines per deploy
Reason: When something breaks, you know exactly what caused it.
Mistake 2: No Rollback Plan
❌ "We're committed now"
✅ Every deploy has a rollback procedure
Mistake 3: Testing Manually
❌ "We'll test in staging by hand"
✅ Automated tests run before every deploy
Mistake 4: Ignoring Logs/Metrics
❌ "The app is running, who cares about logs?"
✅ Structured logging and metrics from day 1
Mistake 5: Same Config Everywhere
❌ Production and staging use same database
✅ Separate infrastructure, separate secrets, separate configs
The DevOps Checklist
Before you call it "production-ready":
✅ CI/CD Pipeline: Every commit triggers tests & build
✅ Automated Tests: Unit + integration tests pass before deploy
✅ Containerized: Docker image with multi-stage build
✅ Orchestrated: Runs on Kubernetes or managed service
✅ Health Checks: Liveness & readiness probes configured
✅ Monitoring: Logs, metrics, traces all flowing
✅ Alerts: Meaningful alerts (not noise)
✅ Rollback Plan: Can recover in < 5 minutes
✅ Secrets Management: Passwords never in code
✅ Documentation: Runbooks for common issues
Tools You'll Use
CI/CD: GitHub Actions, GitLab CI, Jenkins
Containerization: Docker, Podman
Orchestration: Kubernetes, Docker Swarm, AWS ECS
Monitoring: Prometheus, Datadog, New Relic
Logging: ELK Stack, Splunk, CloudWatch
Tracing: Jaeger, Zipkin, Datadog
Next Steps
Set up CI/CD: Start with GitHub Actions (free)
Containerize your app: Write a Dockerfile
Deploy to staging: Use Docker Compose locally, Kubernetes in cloud
Add monitoring: Start with basic metrics
Create runbooks: Document how to handle failures
Practice rollbacks: Actually execute a rollback (in staging first)
Resources
GitHub Actions Docs
Kubernetes Documentation
Docker Best Practices
Prometheus Monitoring
Master DevOps Practices
At Vector Skill Academy, we teach DevOps the way production teams do it. Automation. Consistency. Reliability.
Explore our DevOps & Deployment program
Top comments (0)