Mustafa ERBAY

Posted on May 30 • Originally published at mustafaerbay.com.tr

3 Deploy Strategies for CI/CD: Cost and Efficiency Analysis

#career #cicd #deploy #maliyet

Taking software live isn't just about copying code to a server; it's a process that directly impacts critical factors like operational cost, user experience, and business continuity. Over the years, deploying systems of varying scales in different projects, I've repeatedly seen that there's no such thing as "the best" deploy strategy; every choice comes with its own price.

In this post, I will analyze three fundamental deploy strategies I've frequently encountered and implemented in my CI/CD processes, focusing on cost and efficiency based on my experiences. My aim is not just to delve into technical details but also to illustrate with concrete examples how these strategies reflect on your business and career, and what trade-offs you need to consider.

The Importance of Deploy Strategy Selection and Its Relationship with Cost

A poorly chosen deploy strategy can lead to significant costs that are initially unseen but accumulate over time. These costs include not only server rent but also developer time, engineering effort, lost customers, and brand reputation. I personally experienced a situation in a production ERP where a simple deploy error caused shipments to halt for 15 minutes, leading to dozens of trucks waiting and millions of liras in potential losses. This is why deploy strategy selection is a topic that concerns not just the DevOps team but all business units.

Every deploy we make is essentially an act of risk management. The investment we make to minimize this risk is directly related to the strategy we choose. For example, an application that experiences constant downtime requires continuous operational intervention, forcing the engineering team to dedicate valuable time to bug fixing. This reduces the effort that could be spent on developing new features, weakening the product's competitiveness in the long run. For me, the efficiency of a strategy is measured not only by how fast we can deploy but also by how quickly we can detect potential errors and revert with minimal cost.

ℹ️ Hidden Costs

When choosing a deploy strategy, you should consider not only the initial setup cost but also the impact of potential outages on your business, the Mean Time To Recovery (MTTR) in case of an error, and the time your engineering team will spend managing these processes. These "hidden costs" can result in much higher bills than expected in the long run.

1. Recreate Strategy: The Price of Simplicity

Recreate, as the name suggests, is a deployment method that involves completely shutting down existing application instances, then creating and launching new ones from scratch with the new version. This is the simplest of deploy strategies and is typically preferred for small projects or in development/test environments.

How It Works?

All currently running application instances are stopped.
Existing instances are completely deleted.
New instances are created and started with the new version of the application.

Advantages and Disadvantages

Advantages:

Simplicity: It's the easiest strategy to configure and manage. It doesn't require extra tooling or complex routing.
Easy to Understand: It has a flow that everyone in the team can easily grasp.
Resource Consumption: Generally doesn't require extra server resources (because it shuts down the old and starts the new).

Disadvantages:

Downtime: This is its biggest drawback. The application becomes completely inaccessible until the old version is shut down and the new one is launched. This downtime can last minutes, depending on the application's startup time and the complexity of database migrations. In the initial MVP versions of my side products, I saw users abruptly leaving, and sometimes even experiencing data loss, due to these outages.
User Experience: Downtime directly negatively impacts user experience. It's unacceptable, especially for systems requiring continuous access like e-commerce sites or financial applications.
Risk: If the new version fails to start, the application can remain down for an extended period. The rollback process also requires the same downtime.

Real-World Scenario and My Perspective

I use the Recreate strategy for the backend of a small project or my personal website, i.e., in scenarios where it handles a few hundred requests a day and a few minutes of downtime is acceptable. For example, a simple deploy flow like the one below might suffice for a basic service managed with docker-compose:

#!/bin/bash
echo "Stopping application and deleting old images..."
docker-compose down --rmi all
echo "Building new images..."
docker-compose build
echo "Starting application..."
docker-compose up -d
echo "Deploy complete. Application will be accessible in a few seconds."

This flow, if the application takes 10-15 seconds to start up, can lead to a total downtime of 30-60 seconds. While this duration might be acceptable for a low-traffic blog site or an internal tool, it's definitely not suitable for a high-traffic platform. In my opinion, Recreate is a starting point for the learning phase or for very low-budget, low-criticality projects, but it's a strategy that should be abandoned as soon as possible for any application with growth ambitions.

2. Rolling Update: A Balanced Approach

Rolling Update is a more sophisticated strategy developed to overcome the downtime disadvantage of Recreate. In this method, existing application instances are gradually replaced with new ones. The application remains accessible during the deploy, but for a short period, both old and new versions might run simultaneously.

How It Works?

One or more instances of the new version are launched.
Once these new instances are deemed healthy, one or more instances of the old version are stopped.
This process continues until all old instances are replaced with the new version.

Advantages and Disadvantages

Advantages:

Reduced Downtime: The application generally remains accessible during the deploy. Users experience no downtime or very brief, unnoticeable momentary interruptions.
Safer Rollback: If an issue is detected, the deploy can be paused, and rolling back to the old version is easier because some of the old instances might still be running.
Resource Efficiency: It doesn't require a duplicate infrastructure like Blue/Green, which makes it more cost-effective.

Disadvantages:

Version Skew: Since old and new versions run simultaneously during the deploy, these two versions must be compatible with each other. Changes in APIs or database schemas can complicate this. While updating backend services for a large Turkish e-commerce site, it took us 20 minutes to realize that some users' shopping carts were corrupted due to schema differences between the old and new versions during PATCH operations.
Slower Rollout: Gradual replacement can make the entire deploy process take longer.
Complex Health Checks: Robust health checks are required to accurately determine if new instances are "healthy."

Real-World Scenario and My Perspective

For most medium-scale and critical applications, Rolling Update offers a good balance between cost and reliability. Kubernetes Deployment objects use the Rolling Update strategy by default, and we can fine-tune this process with maxSurge and maxUnavailable parameters.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1       # How many new instances can be created in addition to existing ones
      maxUnavailable: 1 # How many instances can be unavailable at the same time
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app-container
        image: my-registry/my-app:v2.0 # New image version
        ports:
        - containerPort: 8080
        livenessProbe: # Checks if the application is running
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        readinessProbe: # Checks if the application is ready to receive requests
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 3

In the Kubernetes configuration above, with maxSurge: 1 and maxUnavailable: 1, while one of the existing 3 instances is being updated to the new version, the total number of instances can go up to 4 or down to 2. This helps maintain the application's performance during the deploy. However, to mitigate the risk of version skew, I always make sure to design my APIs to be backward compatible. I also always perform database schema changes in two phases: first, changes that prevent incompatibility, then mandatory changes. [Related: database schema migration strategies]

⚠️ Health Checks Are Critical

The most critical point in Rolling Update is health checks like livenessProbe and readinessProbe. If an instance mistakenly reports itself as healthy and receives traffic, users might encounter errors. Therefore, it's crucial that these checks accurately reflect the application's true state, and even act like an integration test.

3. Blue/Green Deployment: The Goal of Zero Downtime

Blue/Green deployment, as its name suggests, is based on the principle of running two identical production environments in parallel: "blue" and "green." While one environment (e.g., "blue") handles live traffic, the other environment ("green") is updated with the new version. After testing is complete, traffic is switched to the green environment, and the blue environment remains on standby or is updated for the next deploy. This strategy offers near-zero downtime and instant rollback capability.

How It Works?

While the "blue" environment processes live traffic, the "green" environment is set up with the new application version and tested.
Once all tests are successful, all live traffic is redirected to the "green" environment by changing a load balancer or DNS settings.
The "blue" environment is no longer active. It can be kept as a backup or updated to take the place of the "green" environment for the next deploy.

Advantages and Disadvantages

Advantages:

Zero Downtime: Users experience no downtime as traffic is directly routed to the new environment.
Instant Rollback: If an issue is detected in the "green" environment, traffic can be instantly redirected back to the "blue" environment. This is one of the fastest rollback mechanisms.
Isolated Testing: The new version can be thoroughly tested in an isolated environment with real traffic simulations before going live.

Disadvantages:

Infrastructure Cost: The biggest disadvantage is the requirement to continuously maintain two fully-fledged production environments. This means double the server, database, and network resources. When deploying a critical financial module for an internal banking platform, I performed a risk-benefit analysis to justify this cost. The risk was so high that cost became a secondary concern.
Stateful Services Management: Managing stateful services like databases is complex. It requires carefully synchronizing databases between the two environments or managing migrations.
Complex Configuration: Load balancer, DNS, and network configurations are more complex.

Real-World Scenario and My Perspective

Blue/Green is widely used in sectors where business continuity is critical, such as finance, healthcare, or telecommunications. In my own production ERP, we set up a Blue/Green-like structure for updates to core modules. The main challenge here was designing database migrations to be seamless and backward compatible. Typically, for this type of deploy, an Nginx configuration like the one below or cloud provider load balancer features are used:

# Nginx conf example for Blue/Green
upstream blue_servers {
    server 192.168.1.10; # Server IPs in the Blue environment
    server 192.168.1.11;
}

upstream green_servers {
    server 192.168.1.20; # Server IPs in the Green environment
    server 192.168.1.21;
}

server {
    listen 80;
    server_name myapp.example.com;

    location / {
        # We define the environment to which live traffic will be directed here.
        # During deploy, this 'include' line is changed or updated by automated tools.
        include /etc/nginx/current_target_env.conf;
    }
}

In the current_target_env.conf file, traffic is redirected by changing the line proxy_pass http://blue_servers; or proxy_pass http://green_servers;. This is a very practical method for someone like me who enjoys bare-metal and container hybrid deployments.

💡 Strategies for Stateful Services

When performing Blue/Green with stateful services like databases, we usually keep the database common. I always design database schema changes for the new version to be backward compatible. That is, the new version should be able to understand both the old and new schemas. Migrations are performed either before the traffic switch or during the traffic switch in a way that both the old and new versions can work compatibly.

4. Canary Deployment and A/B Testing: Controlled Risk and Optimization

Canary deployment takes the controlled transition feature of Blue/Green even further. The new version is first rolled out to a small percentage of users (typically 1-5%). Feedback and metrics from this "canary group" are carefully monitored. If everything goes well, the new version is gradually rolled out to more users. This is an approach that maximizes risk management.

How It Works?

The new version of the application ("canary") is deployed to a small group of servers or a user segment.
The performance, error rates, and user experience metrics of this canary group are closely monitored.
If the canary version is stable and meets expectations, it is gradually rolled out to more users.
This process is completed when all users have transitioned to the new version.

Advantages and Disadvantages

Advantages:

Minimum Risk: Ensures that only a small group of users is affected in case of a major issue. This minimizes the damage caused by an error. When testing a new algorithm for my Android spam blocker application, I used this method to ensure that potential errors reached very few users.
Real-World Testing: The new version of the application is tested with real users and real traffic. This helps uncover edge cases that might not be caught in test environments.
A/B Testing Integration: Canary deploy can also be used for A/B testing. We can present different feature sets to different user groups to measure which version performs better.

Disadvantages:

Complex Monitoring: Metrics, logs, and traces need to be monitored in great detail and in real-time. There is a high risk of alert fatigue.
Traffic Routing Complexity: Advanced load balancer or service mesh capabilities are required to segment users into specific percentages.
Longer Rollout Time: The gradual transition can make the entire deploy process take longer.

Real-World Scenario and My Perspective

Canary deploy is ideal for large-scale, critical applications that are constantly adding new features. For example, when uploading a new version of a mobile application to the Play Store, it's common to first roll it out to 5% of users, check for errors, and then gradually expand the rollout.

On the traffic routing side, reverse proxies like Nginx or service mesh solutions like Istio can be used. For example, in Nginx, it's possible to segment traffic into percentages using the split_clients module:

# Nginx conf example for Canary Deployment
http {
    map $cookie_ab_test $backend_pool {
        default green_servers; # 95% to new version (Green)
        "~^blue" blue_servers; # 5% to old version (Blue)
    }

    # Alternatively, IP-based or Header-based separation can be done.
    # split_clients $remote_addr $canary_traffic {
    #     5%   canary; # 5% of traffic to canary
    #     *    production; # The rest to the main environment
    # }

    upstream blue_servers { server 192.168.1.10; }
    upstream green_servers { server 192.168.1.20; }
    upstream canary_servers { server 192.168.1.30; } # Canary environment

    server {
        listen 80;
        server_name myapp.example.com;

        location / {
            # proxy_pass http://$backend_pool; # Cookie-based example
            proxy_pass http://canary_traffic; # split_clients-based example
        }
    }
}

This complexity also brings a significant monitoring burden. Real-time monitoring of metrics, logs, and traces with tools like Prometheus, Grafana, and ELK Stack is vital. Otherwise, we might fail to notice an issue in a small canary group and mistakenly roll it out to all users. [Related: monitoring your systems with observability]

🔥 Alert Fatigue and False Positives

In Canary deploy, since a large number of metrics are monitored, false positives or "alert fatigue" can occur. This can lead to real problems being overlooked. Therefore, it's important to optimize alarms by setting meaningful thresholds only for critical metrics and to establish automatic "healing" mechanisms.

Critical Factors Considered in Strategy Selection and My Approach

Choosing the right deploy strategy depends on the application's specific needs, budget, team maturity, and the criticality of the business. There is no single "one-size-fits-all" solution. Here are some critical factors I consider when selecting a deploy strategy for my projects:

Acceptable Downtime Tolerance:
- Hours/Minutes: Internal tools, personal websites (Recreate).
- Seconds: Most commercial applications, background services (Rolling Update).
- Zero: Finance, healthcare, e-commerce (Blue/Green, Canary).
Infrastructure Cost: The additional resource requirements (servers, load balancers, storage) introduced by each strategy. Blue/Green and Canary increase infrastructure costs.
Rollback Time and Complexity: How quickly and safely can we revert to the old version in case of an error? Blue/Green and Canary are very advantageous in this regard.
Team Maturity and Competence: The team's ability to manage, monitor, and troubleshoot complex deploy processes. Recreate requires the least expertise, while Canary demands the most.
Application Criticality Level: The impact of the application on business processes. The deploy requirements for a financial system differ from those of a blog site.
Test Coverage and Automation: How well the new version is tested. Comprehensive automated tests can support more aggressive deploy strategies.

In my opinion, when choosing a deploy strategy, it's essential to find the "most suitable" rather than the "best." My general approach is to start with Rolling Update whenever possible, and then transition to Blue/Green or Canary as the application's criticality and traffic volume increase. This transition allows me to optimize infrastructure costs and enables the team to gradually adapt to the processes.

ℹ️ Decision Matrix

A simple decision matrix like the one below can guide deploy strategy selection:

| Feature / Strategy | Recreate | Rolling Update | Blue/Green | Canary

DEV Community

3 Deploy Strategies for CI/CD: Cost and Efficiency Analysis

The Importance of Deploy Strategy Selection and Its Relationship with Cost

1. Recreate Strategy: The Price of Simplicity

How It Works?

Advantages and Disadvantages

Real-World Scenario and My Perspective

2. Rolling Update: A Balanced Approach

How It Works?

Advantages and Disadvantages

Real-World Scenario and My Perspective

3. Blue/Green Deployment: The Goal of Zero Downtime

How It Works?

Advantages and Disadvantages

Real-World Scenario and My Perspective

4. Canary Deployment and A/B Testing: Controlled Risk and Optimization

How It Works?

Advantages and Disadvantages

Real-World Scenario and My Perspective

Critical Factors Considered in Strategy Selection and My Approach

Top comments (0)