DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Choosing a Deploy Strategy in CI/CD Pipeline Optimization

The biggest mistake technical teams make when choosing a deployment strategy is focusing solely on theoretical "zero-downtime" promises while ignoring the operational and financial bill. Everyone keeps talking about blue-green or canary deployments, but nobody discusses how much CPU, memory, and engineering hours these architectures consume in the background. If your budget isn't infinite and you don't want to wake up to database deadlocks at midnight, you must be pragmatic when choosing your deployment strategy.

Throughout my 20 years of field experience, I have implemented, broken, and rebuilt each of these strategies multiple times on both bare-metal servers and hybrid container deployments. In this article, I lay out the cost and risk analyses I gathered directly from the field—stuff you won't find in theoretical textbooks—complete with concrete figures and configurations. My goal is to save you from unnecessary infrastructure costs and help you choose the most suitable, least headache-inducing method for your project.

The Illusion in Deployment Strategies: Is Blue-Green Always the Savior?

Blue-green deployment relies on running two identical production environments (blue and green) and routing traffic from one to the other at the load balancer level. It sounds fantastic on paper: risk is near zero because once the new version (green) is ready, you switch traffic there with a single flip. However, this approach exactly doubles your infrastructure costs. If you have a production environment with 32 GB RAM and 16 vCPUs, you have to keep an equivalent amount of resources idle just to use during deployment.

In one of our production ERP projects, running on PostgreSQL 15 and a FastAPI backend architecture, we decided to use blue-green deployment. Our server costs increased by exactly 100% the following month because the green environment had to remain up and running constantly. Moreover, it doesn't just stop with the application servers; both environments must safely access the same database without exhausting the connection pools. The biggest challenge we faced during this process was exceeding PostgreSQL connection limits during the transition phase when both environments were active.

Below is a simple yet effective upstream configuration I used on Nginx to dynamically route traffic between the blue and green environments during this transition:

# /etc/nginx/conf.d/upstream.conf
upstream backend_servers {
    # Blue environment (Active)
    server 10.0.1.10:8000 max_fails=3 fail_timeout=10s;

    # Green environment (New version - standby during transition)
    server 10.0.1.20:8000 backup;
}

server {
    listen 80;
    server_name api.uretimerp.local;

    location / {
        proxy_pass http://backend_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_next_upstream error timeout http_502 http_503;
    }
}
Enter fullscreen mode Exit fullscreen mode

In this configuration, the backup parameter prevents traffic from going to the green environment as long as the blue environment is healthy. The moment we disable the blue environment (for example, by stopping the service or changing the proxy definition via Nginx reload), all traffic instantly shifts to green. This method is fast, but it can exhaust the database connection pool in seconds during the transition if you aren't using PgBouncer. Therefore, blue-green requires optimizing not just server costs, but also database connection limits.

⚠️ Connection Pool Warning

During the "warm-up" phase of blue-green transitions, where both environments hit the database simultaneously, reaching the PostgreSQL max_connections limit is highly likely. To prevent this, you should absolutely place a connection pooler like PgBouncer in between.

Canary Deployment Cost Analysis: The Hidden CPU and Memory Toll of Traffic Splitting

Canary deployment is the strategy of releasing a new version to a small subset of users (like 5% or 10%) first, monitoring error rates, and rolling it out gradually. This approach is perfect for risk management, especially in high-traffic systems. However, the hidden cost of this method emerges in the infrastructure's observability and routing layers. To split traffic, you need to set up an advanced L7 load balancer or a Service Mesh architecture, which significantly increases system complexity and resource consumption.

When I tried the canary strategy on the backend infrastructure of a high-traffic side project I developed, I observed a 25% increase in the CPU consumption of Envoy proxies. This is because routing based on headers, cookies, or IP addresses for every incoming request requires substantial processing power. Additionally, to measure the performance of the canary release, you need to separate logs and metrics coming from both versions in real-time. This spikes the number of log lines written per second on Grafana Loki or Elasticsearch, inviting disk space emergencies.

The following Envoy configuration snippet shows how we route 10% of traffic to the canary (v2) release and 90% to the stable (v1) release:

# envoy-route-config.yaml
route_config:
  name: local_route
  virtual_hosts:
    - name: backend
      domains: ["*"]
      routes:
        - match: { prefix: "/" }
          route:
            weighted_clusters:
              clusters:
                - name: backend_v1
                  weight: 90
                - name: backend_v2
                  weight: 10
              total_weight: 100
Enter fullscreen mode Exit fullscreen mode

When you use this type of weighted routing, encountering edge cases is inevitable. For instance, if a user's first request goes to v1 and the second falls into v2, you might experience session or state inconsistencies. If you aren't working with a stateless architecture, you must enable sticky session mechanisms, which increases memory (RAM) usage on the load balancer. While canary looks very safe in theory, the cost of APM tools and logging infrastructure to monitor it often makes it too expensive for small and medium-sized projects.

Rolling Update and Database Schema Mismatch: The Zero-Downtime Lie

The rolling update strategy is the safest haven for teams using Docker Compose or Kubernetes. The system promises zero downtime by shutting down old containers one by one and replacing them with new ones. However, this approach completely breaks when you make a backward-incompatible change in the database schema. During deployment, both the old code (v1) and the new code (v2) run simultaneously. If you deleted or renamed a column in the database for the v2 release, the v1 containers that are still alive will start throwing errors.

I once saw a major Turkish e-commerce site unable to receive orders for about 45 minutes due to a similar database schema mismatch. The solution is to design the software development process according to the "Expand and Contract" pattern. In other words, if you want to change a column, you cannot do it in a single deployment. First, you add the new column (expand), deploy the code to write to both columns, migrate the old data, and then delete the old column in a subsequent deployment (contract).

The following SQL series shows how to safely execute this two-stage migration process on PostgreSQL:

-- STEP 1: Add the new column (Must be nullable or have a default value)
ALTER TABLE siparisler ADD COLUMN teslimat_adresi_yeni TEXT;

-- STEP 2: Deploy the application code (Make it write to both the old and new columns)
-- At this stage, v1 and v2 codes can run simultaneously without throwing errors.

-- STEP 3: Migrate old data to the new column in the background (Should be done in batches to avoid locking the table)
UPDATE siparisler 
SET teslimat_adresi_yeni = teslimat_adresi 
WHERE teslimat_adresi_yeni IS NULL;

-- STEP 4: Deploy the new code to read only from the new column.
-- STEP 5: Safely drop the old column (Contract stage)
ALTER TABLE siparisler DROP COLUMN teslimat_adresi;
Enter fullscreen mode Exit fullscreen mode

This process reduces deployment risk to zero but at least triples development and testing time. Deploying with "zero downtime" is not just about writing rollingUpdate: maxSurge: 1 in your pipeline; it requires a disciplined schema migration strategy at the database level. If you cannot afford this, scheduling a 5-minute planned maintenance window at midnight is much more pragmatic and safer than managing this complex process.

Resource Consumption and Infrastructure Bill: Pipeline Runner Economy in CI/CD

The cost of deployment strategies is not limited to production servers; the resources consumed by CI/CD runners preparing those deployment packages (artifacts or docker images) are also significant invoice items. Especially in Docker-based build processes, building images from scratch on every deploy is both a waste of time and fills up the runner server's disk in seconds. Docker disk space issues are a nightmare that every system administrator using self-hosted runners has encountered at least once.

I use self-hosted GitLab Runners on VPS servers hosting my own side projects and client projects. In the beginning, I saw pipelines crashing weekly due to disk space issues. The reason was accumulated old build caches and dangling containers. To solve this problem, I applied cgroup limits and enabled systemd timers that perform automatic cleanup after every successful build.

Below, I share the systemd unit and timer configuration I use on self-hosted runner servers to prevent disk space emergencies by running every night at 03:00:

# /etc/systemd/system/docker-cleanup.service
[Unit]
Description=Docker Disk Cleanup Service
After=docker.service

[Service]
Type=OneShot
ExecStart=/usr/bin/docker system prune -af --volumes

# /etc/systemd/system/docker-cleanup.timer
[Unit]
Description=Run Docker Cleanup Daily at 3 AM

[Timer]
OnCalendar=*-*-* 03:00:00
Persistent=true

[Install]
WantedBy=timers.target
Enter fullscreen mode Exit fullscreen mode

Thanks to this simple automation, we got rid of up to 150 GB of useless Docker cache accumulating weekly. Additionally, by using multi-stage builds in pipelines and properly configuring the Docker layer cache mechanism, I reduced build times from 12 minutes to under 2 minutes. Optimizing pipeline runners not only increases developer productivity but also visibly reduces the monthly bill paid to cloud providers.

Error Detection and Rollback Dynamics: The Speed vs. Safety Trade-off

When a deployment fails, the most important metric is Mean Time to Recover (MTTR)—the time it takes to get the system back up and running. Your deployment strategy directly determines this rollback duration. In a blue-green architecture, rollback consists of routing traffic back to the old environment (blue) at the load balancer level, which takes milliseconds. In a rolling update, the old Docker image must be pulled from the registry again, containers stopped, and new ones started, which can take 5-10 minutes depending on the size of the application.

However, there is a highly critical edge case here: the database state. If you notice an error and roll back 10 minutes after the new deployment (v2) starts running and writing data in the new format to the database, the old code (v1) might crash when trying to read this new data. Therefore, while rolling back code (code rollback) is very easy, rolling back data (data rollback) is incredibly difficult and sometimes requires manual intervention.

The following bash script is a simple rollback automation showing how to automatically revert to the previous stable version when an error is detected (e.g., when HTTP 500 errors exceed a threshold) in a Docker Compose-based hybrid deployment:

#!/bin/bash
# rollback.sh - Automatic Rollback Script
set -e

APP_NAME="uretim_backend"
HEALTH_CHECK_URL="http://localhost:8000/health"
MAX_RETRIES=5
DELAY=10

echo "Checking health status of the new version..."

for ((i=1; i<=MAX_RETRIES; i++)); do
    STATUS_CODE=$(curl -o /dev/null -s -w "%{http_code}\n" $HEALTH_CHECK_URL || true)
    if [ "$STATUS_CODE" -eq 200 ]; then
        echo "Health check successful (HTTP 200)."
        exit 0
    fi
    echo "Attempt $i failed (HTTP $STATUS_CODE). Waiting $DELAY seconds..."
    sleep $DELAY
done

echo "Critical Error: New version is unstable! Initiating rollback..."
# We pull and restart the previous image tag
docker compose -f docker-compose.prod.yml pull --quiet
docker compose -f docker-compose.prod.yml up -d --no-deps $APP_NAME

# Check logs and write to systemd journal
journalctl -u docker.service --since "5 minutes ago" | grep -i "error"
echo "Rollback completed successfully."
exit 1
Enter fullscreen mode Exit fullscreen mode

This kind of automation is a lifesaver, especially during midnight deploys. But remember, if you changed the database schema, this script cannot protect you from data inconsistency. That's why the rule is always: deploy the database first and ensure it is backward compatible, then deploy the application code.

My Choice and Pragmatic Decision Matrix: What Do I Use and When?

So, in light of all these trade-offs, which strategy should we choose? My clear position on this is: choose whatever fits the size of the project, the team's budget, and most importantly, your operational capacity. Trying to set up Kubernetes and run canary deployments with a 2-person team just because it looks "cool" is nothing short of suicide. For most medium-sized projects and side products, a rolling update or a simple "planned maintenance" window is far better than the cost and complexity that blue-green brings.

In the table below, you can see the pragmatic decision matrix I created based on my long years of experience:

Strategy Hardware Cost Operational Complexity Rollback Time Best Suited Scenario
Recreate (With Downtime) Zero (No extra server needed) Very Low Low (Seconds) Internal tools, test environments, low-budget projects
Rolling Update 10-20% (Temporary overhead) Medium Medium (Minutes) Stateless web applications, budget-constrained projects
Blue-Green 100% (Identical standby environment) High Very Fast (Milliseconds) Critical financial systems, database-independent services
Canary 20-30% (Monitoring & routing overhead) Very High Fast (Milliseconds) Very high-traffic B2C platforms, microservices

In my own projects, I generally prefer the Rolling Update strategy. If there is a critical schema change in the database, I put the system into maintenance mode for 1 minute during the lowest traffic hour of the night (usually between 04:00 - 05:00), run the migration, and then bring up the new containers. This approach provides me with zero hardware cost and eliminates data inconsistency risks that would keep me awake at night.

I experienced a similar trade-off during a VPS migration process before, and I saw firsthand how choosing the most reliable and simple method instead of the most complex tool pays off in the long run. Remember, the best deployment strategy is the one your team knows best and can intervene in the fastest when an error occurs.

Top comments (0)