InstaDevOps

Posted on Nov 13 • Originally published at instadevops.com

Zero-Downtime Deployments: Blue-Green vs Canary vs Rolling Updates

#cloud #devops

Introduction

In today's always-on world, downtime is no longer acceptable. Users expect your services to be available 24/7, and even a few minutes of downtime can result in lost revenue, damaged reputation, and frustrated customers. Yet software still needs to be updated, bugs fixed, and features deployed.

This is where zero-downtime deployment strategies become critical. These patterns allow you to deploy new versions of your application while keeping your service fully available to users. In this comprehensive guide, we'll explore the three most popular zero-downtime deployment strategies—Blue-Green, Canary, and Rolling Updates—along with when to use each approach.

Why Zero-Downtime Deployments Matter

Before diving into specific strategies, let's understand the business impact of deployment downtime:

Financial Impact

For e-commerce sites, downtime directly translates to lost revenue. Amazon reportedly loses $66,000 per minute of downtime. Even for smaller businesses, unavailability during peak hours can be devastating.

User Experience

Modern users are impatient. Studies show that 40% of users abandon a website that takes more than 3 seconds to load. Imagine the abandonment rate when your site is completely down during a deployment.

Competitive Advantage

Companies that can deploy multiple times per day without downtime can iterate faster, respond to market changes quicker, and fix bugs immediately. This agility is a significant competitive advantage.

Team Morale

When deployments require scheduled maintenance windows and weekend work, it creates stress and reduces quality of life for engineering teams. Zero-downtime deployments enable deployments during business hours with confidence.

Blue-Green Deployments

How It Works

Blue-Green deployment maintains two identical production environments called "Blue" and "Green." At any time, one environment serves production traffic while the other is idle.

The deployment process:

Blue environment serves production traffic (v1.0)
Deploy new version (v2.0) to idle Green environment
Test thoroughly on Green environment
Switch router/load balancer to point to Green
Blue becomes idle, ready for the next deployment

# Example: Blue-Green deployment with AWS ECS

# Blue Task Definition (Current Production)
resource "aws_ecs_task_definition" "blue" {
  family = "myapp-blue"
  container_definitions = jsonencode([{
    name  = "app"
    image = "myapp:v1.0"
    # ... other settings
  }])
}

# Green Task Definition (New Version)
resource "aws_ecs_task_definition" "green" {
  family = "myapp-green"
  container_definitions = jsonencode([{
    name  = "app"
    image = "myapp:v2.0"
    # ... other settings
  }])
}

# Application Load Balancer Target Groups
resource "aws_lb_target_group" "blue" {
  name     = "myapp-blue-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 10
  }
}

resource "aws_lb_target_group" "green" {
  name     = "myapp-green-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 10
  }
}

# Listener rule to switch between blue and green
resource "aws_lb_listener" "main" {
  load_balancer_arn = aws_lb.main.arn
  port              = "80"
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = var.active_target_group # Switch this to toggle
  }
}

Advantages

Instant Rollback: If issues arise, simply switch back to the previous environment. Rollback takes seconds, not minutes.

Full Testing in Production Environment: You can test the new version in a production-identical environment before switching traffic.

Simple to Understand: The concept is straightforward—two environments, one active.

Database Migrations: You have time to run migrations before switching traffic.

Disadvantages

Resource Intensive: Requires 2x infrastructure resources—you're maintaining two complete production environments.

Database Challenges: If your new version requires schema changes, both environments must support both old and new schemas during the transition.

Cost: Running duplicate infrastructure can be expensive, especially for large applications.

State Management: Handling in-flight requests and session state during the switch requires careful planning.

When to Use Blue-Green

You have critical deployments where instant rollback is essential
Your infrastructure costs are manageable for 2x capacity
You need comprehensive testing in production before switching
You have relatively stateless applications
You deploy infrequently but need high confidence

Implementation Example with Kubernetes

# Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
  labels:
    app: myapp
    version: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: app
        image: myapp:v1.0
        ports:
        - containerPort: 8080

---
# Green Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
  labels:
    app: myapp
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: app
        image: myapp:v2.0
        ports:
        - containerPort: 8080

---
# Service - Switch between blue and green
apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  selector:
    app: myapp
    version: blue  # Change to 'green' to switch traffic
  ports:
  - port: 80
    targetPort: 8080
  type: LoadBalancer

Canary Deployments

How It Works

Canary deployments gradually roll out changes to a small subset of users before making them available to everyone. The name comes from "canary in a coal mine"—the canary (small user subset) detects problems before the entire user base is affected.

The deployment process:

Deploy new version to a small percentage of servers (e.g., 5%)
Route a small percentage of traffic to the new version
Monitor metrics (error rates, latency, user behavior)
If metrics are healthy, gradually increase traffic to new version
If problems detected, immediately route all traffic back to old version
Eventually, 100% of traffic goes to new version

# Example: Canary deployment with Kubernetes and Flagger

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: production
spec:
  # Deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  # Service configuration
  service:
    port: 80
    targetPort: 8080
  # Canary analysis
  analysis:
    # Schedule interval (default 60s)
    interval: 1m
    # Max number of failed checks before rollback
    threshold: 5
    # Max traffic percentage routed to canary
    maxWeight: 50
    # Canary increment step
    stepWeight: 5
    # Metrics for canary analysis
    metrics:
    - name: request-success-rate
      # Minimum req success rate (non 5xx responses)
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      # Maximum req duration P99
      thresholdRange:
        max: 500
      interval: 1m
    - name: error-rate
      # Maximum error rate
      thresholdRange:
        max: 1
      interval: 1m
    # Webhooks for custom metrics
    webhooks:
    - name: load-test
      url: http://flagger-loadtester/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"

Advantages

Risk Mitigation: Problems affect only a small percentage of users, limiting blast radius.

Real User Testing: Unlike A/B testing, this uses real production traffic with real users.

Data-Driven Decisions: Automated rollback based on metrics removes emotion from deployment decisions.

Gradual Migration: Particularly useful for major architectural changes or new features.

Resource Efficient: Requires only a small percentage of additional capacity.

Disadvantages

Complexity: Requires sophisticated traffic routing and monitoring infrastructure.

Monitoring Requirements: Need robust metrics and alerting to make data-driven decisions.

Slower Rollouts: Full deployment can take hours or days depending on your strategy.

User Experience Inconsistency: Some users see the new version while others see the old version.

Session Affinity Challenges: Users switching between versions mid-session can cause issues.

When to Use Canary

You have strong monitoring and observability infrastructure
You want to minimize risk for critical applications
You can tolerate gradual rollouts
You have the infrastructure to support sophisticated traffic routing
You deploy frequently and want automated confidence

Canary Deployment Metrics

Key metrics to monitor during canary deployments:

// Example monitoring configuration

const canaryMetrics = {
  // Error rates
  errorRate: {
    threshold: 1.0, // 1% error rate
    comparison: 'canary vs baseline',
    action: 'rollback if exceeded'
  },

  // Latency
  p99Latency: {
    threshold: 500, // milliseconds
    comparison: 'canary P99 latency',
    action: 'rollback if exceeded'
  },

  // Success rate
  successRate: {
    threshold: 99.0, // 99% success
    comparison: 'canary success rate',
    action: 'rollback if below'
  },

  // Custom business metrics
  conversionRate: {
    threshold: -5.0, // -5% change
    comparison: 'canary vs baseline',
    action: 'rollback if decreased by more than 5%'
  },

  // Infrastructure metrics
  cpuUsage: {
    threshold: 80, // 80% CPU
    comparison: 'canary CPU usage',
    action: 'rollback if exceeded'
  },

  memoryUsage: {
    threshold: 90, // 90% memory
    comparison: 'canary memory usage',
    action: 'rollback if exceeded'
  }
};

Rolling Deployments

How It Works

Rolling deployments gradually replace instances of the old version with the new version, one (or a few) at a time. This is the default deployment strategy for many platforms including Kubernetes.

The deployment process:

Start with N instances of v1.0 running
Stop one instance of v1.0
Start one instance of v2.0
Wait for health checks to pass
Repeat steps 2-4 until all instances are v2.0

# Kubernetes Rolling Update Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      # Maximum number of pods unavailable during update
      maxUnavailable: 1
      # Maximum number of pods created over desired replica count
      maxSurge: 2
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: app
        image: myapp:v2.0
        ports:
        - containerPort: 8080
        # Readiness probe - traffic only sent when ready
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3
        # Liveness probe - restart if unhealthy
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10
          failureThreshold: 3
        # Lifecycle hooks
        lifecycle:
          preStop:
            exec:
              # Graceful shutdown - drain connections
              command: ["/bin/sh", "-c", "sleep 15"]

Advantages

Resource Efficient: Doesn't require extra infrastructure—uses existing capacity.

Native Support: Built into most orchestration platforms (Kubernetes, ECS, etc.).

Gradual Rollout: Problems affect fewer users than a big-bang deployment.

Simple Configuration: Easy to set up with minimal infrastructure changes.

Cost Effective: No need for duplicate environments.

Disadvantages

Mixed Versions: Old and new versions run simultaneously during rollout.

Slower Rollback: Rolling back means rolling forward with the old version.

Session Issues: Users might hit different versions on subsequent requests.

Potential Downtime: If maxUnavailable is set too high, capacity might drop below requirements.

Database Migrations: Both versions must work with the same database schema.

When to Use Rolling Updates

You have limited infrastructure budget
Your application supports mixed version deployments
You deploy frequently and need simplicity
Your platform provides native rolling update support
You have good health checks and monitoring

Advanced Rolling Update Patterns

# Example: Rolling update with pre-deployment validation

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 20
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0  # Never reduce capacity
      maxSurge: 5        # Add 5 pods at a time
  minReadySeconds: 30    # Wait 30s before considering pod ready
  progressDeadlineSeconds: 600  # Fail deployment after 10min
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
        version: v2.0
    spec:
      containers:
      - name: app
        image: myapp:v2.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
        # Startup probe - for slow-starting apps
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          failureThreshold: 30
          periodSeconds: 10
        # Readiness probe
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          failureThreshold: 3
        # Liveness probe
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3

Comparison Matrix

Feature	Blue-Green	Canary	Rolling
Rollback Speed	Instant (seconds)	Fast (minutes)	Slow (rolling back)
Resource Cost	High (2x)	Medium (1.1-1.2x)	Low (1x)
Complexity	Low	High	Low
Risk Level	Low	Very Low	Medium
Testing in Prod	Yes (before switch)	Yes (with real users)	Limited
Mixed Versions	No	Yes	Yes
Monitoring Required	Basic	Advanced	Basic
Database Migrations	Easier	Complex	Complex
User Impact on Failure	None (if caught before switch)	1-10%	10-50%

Hybrid Strategies

In practice, many teams combine these strategies:

Blue-Green Canary

Deploy to green environment, route 5% of traffic to green for testing, then switch 100% once validated.

# Example: Blue-Green Canary with weighted routing

# 95% traffic to Blue (current)
Blue Target Group Weight: 95

# 5% traffic to Green (canary)
Green Target Group Weight: 5

# After validation, switch to 100% Green
Blue Target Group Weight: 0
Green Target Group Weight: 100

Rolling Canary

Use rolling updates but pause when 10% of instances are updated, monitor metrics, then continue or rollback.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 10  # 10% traffic to new version
      - pause: {duration: 5m}  # Monitor for 5 minutes
      - setWeight: 25
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 5m}
      - setWeight: 75
      - pause: {duration: 5m}
      # Automatic rollback if metrics fail
      analysis:
        templates:
        - templateName: success-rate
        - templateName: error-rate

Database Migration Strategies

Zero-downtime deployments become challenging when database schema changes are involved.

Backward-Compatible Changes

Make all database changes backward-compatible:

Adding columns: Old code ignores new columns
Adding tables: Old code doesn't use new tables
Expanding constraints: Change VARCHAR(50) to VARCHAR(100)

The Expand-Migrate-Contract Pattern

# Phase 1: Expand (add new column)
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

# Phase 2: Migrate (dual writes)
# Deploy code that writes to both first_name/last_name AND full_name

# Phase 3: Backfill
UPDATE users 
SET full_name = CONCAT(first_name, ' ', last_name)
WHERE full_name IS NULL;

# Phase 4: Deploy code that only uses full_name

# Phase 5: Contract (remove old columns)
ALTER TABLE users 
DROP COLUMN first_name,
DROP COLUMN last_name;

Monitoring and Observability

Successful zero-downtime deployments require comprehensive monitoring:

Key Metrics to Track

// Essential deployment metrics

const deploymentMetrics = {
  // Application metrics
  requestRate: 'requests per second',
  errorRate: '5xx errors per second',
  latencyP50: '50th percentile latency',
  latencyP95: '95th percentile latency',
  latencyP99: '99th percentile latency',

  // Infrastructure metrics
  cpuUtilization: 'CPU usage percentage',
  memoryUtilization: 'Memory usage percentage',
  diskIO: 'Disk I/O operations',
  networkIO: 'Network I/O bandwidth',

  // Business metrics
  activeUsers: 'Current active users',
  conversionRate: 'Purchase/signup rate',
  revenuePerMinute: 'Revenue generation rate',

  // Database metrics
  queryLatency: 'Database query latency',
  connectionPoolUtilization: 'DB connection usage',
  deadlocks: 'Database deadlock count'
};

Automated Rollback Criteria

# Example: Automated rollback logic

def should_rollback(metrics, baseline):
    """
    Determine if deployment should be automatically rolled back
    """
    rollback_criteria = [
        # Error rate increased by more than 50%
        metrics['error_rate'] > baseline['error_rate'] * 1.5,

        # P99 latency increased by more than 100%
        metrics['p99_latency'] > baseline['p99_latency'] * 2.0,

        # Success rate dropped below 99%
        metrics['success_rate'] < 99.0,

        # CPU usage above 90%
        metrics['cpu_usage'] > 90,

        # Memory usage above 95%
        metrics['memory_usage'] > 95,

        # Business metric: conversion rate dropped by more than 10%
        metrics['conversion_rate'] < baseline['conversion_rate'] * 0.9
    ]

    return any(rollback_criteria)

Common Pitfalls and How to Avoid Them

1. Insufficient Health Checks

Problem: Simple health checks that only verify the process is running, not that it's functioning correctly.

Solution: Implement comprehensive health checks:

// Example: Comprehensive health check endpoint

app.get('/health/ready', async (req, res) => {
  const checks = {
    database: await checkDatabaseConnection(),
    cache: await checkRedisConnection(),
    externalAPI: await checkExternalDependencies(),
    diskSpace: await checkDiskSpace(),
    memory: checkMemoryUsage()
  };

  const allHealthy = Object.values(checks).every(check => check.healthy);

  res.status(allHealthy ? 200 : 503).json({
    status: allHealthy ? 'healthy' : 'unhealthy',
    checks,
    timestamp: new Date().toISOString()
  });
});

2. Ignoring Connection Draining

Problem: Terminating instances immediately, causing in-flight requests to fail.

Solution: Implement graceful shutdown:

// Graceful shutdown example

process.on('SIGTERM', async () => {
  console.log('SIGTERM received, starting graceful shutdown');

  // Stop accepting new requests
  server.close(() => {
    console.log('HTTP server closed');
  });

  // Allow time for existing requests to complete
  setTimeout(() => {
    console.log('Forcing shutdown after timeout');
    process.exit(0);
  }, 30000); // 30 second timeout

  // Close database connections
  await database.close();

  // Close other resources
  await cache.disconnect();

  console.log('Graceful shutdown complete');
  process.exit(0);
});

3. Not Testing Rollback Procedures

Problem: Discovering rollback procedures don't work when you need them most.

Solution: Regularly test rollbacks in staging/production:

# Include rollback tests in your deployment pipeline

#!/bin/bash
# test-rollback.sh

set -e

# Deploy new version
kubectl set image deployment/myapp app=myapp:v2.0
kubectl rollout status deployment/myapp

# Run smoke tests
./run-smoke-tests.sh

# Intentionally rollback to test the process
kubectl rollout undo deployment/myapp
kubectl rollout status deployment/myapp

# Verify old version works
./run-smoke-tests.sh

echo "Rollback test successful!"

4. Missing Feature Flags

Problem: Can't disable problematic features without redeploying.

Solution: Use feature flags for all new features:

// Feature flag example with LaunchDarkly

const client = LaunchDarkly.initialize(SDK_KEY);

app.get('/api/data', async (req, res) => {
  const user = { key: req.user.id };

  // Check feature flag
  const useNewAlgorithm = await client.variation(
    'new-algorithm',
    user,
    false // default value
  );

  if (useNewAlgorithm) {
    return res.json(await getDataWithNewAlgorithm());
  } else {
    return res.json(await getDataWithOldAlgorithm());
  }
});

Conclusion

Zero-downtime deployments are no longer optional—they're a requirement for modern applications. The three main strategies each have their place:

Blue-Green: Use when you need instant rollback and can afford 2x infrastructure
Canary: Use when you need maximum safety and have sophisticated monitoring
Rolling: Use when you need resource efficiency and have good health checks

Many organizations use hybrid approaches, combining strategies to get the best of multiple worlds. The key is understanding your specific requirements:

What's your acceptable risk level?
What's your infrastructure budget?
How sophisticated is your monitoring?
How frequently do you deploy?
What's your rollback time requirement?

Start simple with rolling deployments, then evolve to more sophisticated strategies as your needs grow. Invest heavily in monitoring and observability—you can't manage what you can't measure.

Need help implementing zero-downtime deployments? InstaDevOps provides expert consulting and implementation services for deployment strategies, CI/CD pipelines, and infrastructure automation. Contact us for a free consultation.

Need Help with Your DevOps Infrastructure?

At InstaDevOps, we specialize in helping startups and scale-ups build production-ready infrastructure without the overhead of a full-time DevOps team.

Our Services: