DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Zero Downtime Deployment: An Unnecessary Burden for Simple Projects?

In the software world, the concept of "Zero Downtime Deployment" (ZDD) is almost treated like a holy grail. Everyone wants it, everyone aims for it. But in my own experience, I've seen this: not every project truly needs it. Especially for small and medium-sized businesses, the cost of setting up and maintaining such an architecture can far outweigh its benefits.

For a production ERP or a large e-commerce site, downtime is obviously unacceptable. Even seconds can mean lost revenue. However, for my own side projects or a small internal application for a client, a few minutes of downtime isn't the end of the world. In fact, sometimes this "break" can even serve as a signal for users to understand that an update is happening. In this post, I'll explain when I apply these complex deployment strategies and when I decide to just let it go with a "it'll be fine."

What is Zero Downtime Deployment, and What Isn't It?

Zero downtime deployment, as the name suggests, is the process of deploying a new software version without users experiencing any service interruption. Its primary goal is to ensure the application is always accessible during deployment. There are different ways to achieve this: rolling updates, blue-green deployment, canary deployment, and more. Each has its own advantages and disadvantages.

In my experience, ZDD typically requires multiple application instances and a load balancer. The new version runs alongside the old version, and traffic is gradually directed to the new version. So, what isn't ZDD? It doesn't magically fix poorly written code or make a faulty database migration invisible. In fact, it can make such issues even more complex because having multiple versions active simultaneously means more potential points of failure.

ℹ️ Rolling Update Example

You can think of a simple rolling update scenario where I update multiple service instances using Docker Compose. I usually set replicas to 2 or more in my docker-compose.yml file, allowing one container to continue handling traffic while another is being updated. However, this doesn't cover database schema changes or more complex state management.

version: '3.8'
services:
  web_app:
    image: my_app:v1.0
    ports:
      - "80:80"
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
      restart_policy:
        condition: on-failure

In the example above, I've defined three replicas for the web_app service and used update_config to ensure that while one replica is being updated, the others continue to run. This can provide a near-zero downtime experience as long as the application itself is stateless. However, when you introduce state management or database interactions, this simple approach becomes insufficient.

Rolling updates work quite well when your applications are stateless and each instance operates independently. However, in many projects, especially enterprise software, I encounter situations where the application's state or database schema changes. This is where the complexity of zero downtime deployment increases exponentially.

Cost Calculation: What Goes into a Simple Project?

When you decide to implement zero downtime deployment, you quickly realize it's not just a matter of "setting it up and it's done." Especially for simple projects, it can become an unnecessary burden. I generally categorize this burden into three main areas: infrastructure cost, complexity cost, and human cost.

Infrastructure cost means additional servers and load balancers. For blue-green deployment, you might need to allocate twice the resources of your existing application. For example, for a side project running on my own VPS, 2GB RAM and 2 vCPUs might be sufficient, but for zero downtime deployment, I'd need to allocate at least double. This could increase monthly bills by 100%. While this cost might be negligible for an internal platform of a bank, it's a significant expense for a personal project or an SME.

⚠️ Resource Usage Comparison

Let's consider a simple web application. While a single instance might use 1GB RAM and 1 vCPU, for blue-green deployment, you might temporarily need double that, i.e., 2GB RAM and 2 vCPUs. This directly impacts operational costs.

  • Single Instance: 1GB RAM, 1 vCPU
  • Blue-Green Deployment (temporary): 2GB RAM, 2 vCPU

This increase can create a noticeable difference, especially in cloud services with hourly or instant billing. My VPS bill, which was around $20 USD per month for a project, could have risen to $40 USD with this strategy. This situation can strain the budgets of simple projects.

Complexity cost covers everything from changes in your CI/CD pipeline to developing rollback strategies. You need to automate the deployment process and set up automatic rollback mechanisms for error situations. All of this means additional development and testing time. On top of that, there's the extra workload of ensuring API or database compatibility between the current and new versions. This can significantly slow down development speed, especially for small teams.

Finally, there's the human cost. The time developers and operations teams spend setting up, understanding, troubleshooting, and maintaining these complex systems. This varies depending on the project's size and the team's experience, but it often takes longer than expected. Last month, during an attempt, I spent 3 days setting up blue-green deployment for a simple application, including infrastructure automation and tests. In the end, I realized I had spent the time I should have dedicated to developing the application's primary functionality on this infrastructure. This is a trade-off I try to avoid in simple projects.

My "It'll Be Fine" Philosophy and Minor Downtimes

In my 20 years of field experience, I've learned that aiming for 100% uptime isn't always realistic. Especially for small, internally used, or personal projects, accepting a certain level of downtime can be a smart strategy to avoid unnecessary engineering and costs. I call this the "it'll be fine" philosophy.

Consider a reporting tool I developed for a client's accounting department. This tool is only used during business hours and at specific intervals. A 5-minute downtime during midnight or on a weekend doesn't affect anyone. In fact, sometimes a short downtime might even be better for users to notice that a new version of the application is available. Users might think, "Ah, the application seems to be updating," and access the new features a few minutes later. The important thing is that this downtime is communicated in advance and its duration is reasonable.

💡 Acceptable Downtime Management for Simple Projects

When updating the backend of my own side projects, I usually follow these steps:

  1. Maintenance Page: I enable a simple maintenance page (maintenance.html) via Nginx.
  2. Stop Application: I stop the current service. sudo systemctl stop my_app.service
  3. Deploy: I pull the new code, perform necessary migrations, and install the new version of the application.
  4. Start Application: I restart the service. sudo systemctl start my_app.service
  5. Remove Maintenance Page: I revert the Nginx configuration.

This process typically takes between 30 seconds and 2 minutes. Performing this during evening hours after 9 PM or on weekends doesn't negatively impact my users.

# Nginx config for maintenance page
# Temporarily uncomment this block during deployment
# server {
#     listen 80;
#     server_name example.com;
#     return 503;
#     error_page 503 @maintenance;
# }
# location @maintenance {
#     root /var/www/html;
#     rewrite ^(.*)$ /maintenance.html break;
# }
# location / {
#     proxy_pass http://localhost:8000;
#     # ... other proxy settings
# }

The Nginx configuration above allows me to redirect all traffic to a simple HTML page during deployment. This lets me work comfortably in the background while displaying a friendly message to users.

Managing user expectations is critical here. If I'm releasing an update for a mobile app (like the spam blocker I developed on the Android side), the Play Store process itself can take several hours or even days. In such cases, a short downtime in the backend won't affect the overall user experience because the frontend is already updating gradually. The key is to make these decisions based on the nature of the project, the user base, and the business criticality level. Trying to fit every project into the same mold often creates more problems.

Database Migrations and the Backward Compatibility Conundrum

One of the biggest challenges of zero downtime deployment is managing database schema changes. When you update application code and run both old and new versions simultaneously, both must be compatible with the same database. This brings the concept of backward compatibility to the forefront and makes things incredibly complex.

When I worked on a production ERP, I frequently encountered this issue. When we developed a new feature, we often needed to add a new column or change the type of an existing column. When aiming for zero downtime, we had to break these changes into atomic and backward-compatible steps. For example, to add a new column, I would follow these steps:

  1. Add the new column as nullable. (The old application ignores this column.)
  2. Deploy the new version of the application. (Both old and new applications can run.)
  3. The new application starts writing data to the new column. (The old application still writes to the old column or leaves it blank.)
  4. Perform data synchronization (migrate data from the old column to the new column).
  5. Completely disable the old application.
  6. You can then make the new column non-nullable or drop the old column.

🔥 PostgreSQL Schema Change and Backward Compatibility

In one scenario, I needed to add a new column named sku_code to the products table. However, since the old application didn't recognize this column, I couldn't add it directly as NOT NULL.

Incorrect Approach (Causes Downtime):

ALTER TABLE products ADD COLUMN sku_code VARCHAR(50) NOT NULL UNIQUE;

This command locks the table and throws an error or causes downtime because there's no sku_code value for the existing data.

Backward-Compatible Approach (For Zero Downtime):

-- Step 1: Add the column as nullable
ALTER TABLE products ADD COLUMN sku_code VARCHAR(50);

-- Step 2: Deploy the new application version.
-- The new application starts populating the new column.
-- The old application ignores sku_code.

-- Step 3: You can update existing data (optional, during transition)
-- UPDATE products SET sku_code = generate_sku_code_from_name(name) WHERE sku_code IS NULL;

-- Step 4: After the old application is completely removed, you can make the column NOT NULL
ALTER TABLE products ALTER COLUMN sku_code SET NOT NULL;
ALTER TABLE products ADD CONSTRAINT unique_sku_code UNIQUE (sku_code);

This multi-step process, while safe, extends development and deployment time. For a simple project, this effort is usually not worth it. In my personal projects, I don't go into this much detail; I just do a quick downtime and directly run ALTER TABLE.

This "add column, deploy new app, drop old column" dance is a necessity, especially for complex database schemas and high-throughput systems. However, it also introduces a significant engineering burden. Turning every database change into such a multi-step plan slows down development speed and increases the potential for errors. In simple projects, accepting a few minutes of downtime during database migrations completely eliminates this complexity and allows the team to move faster. I see similar trade-offs in monolith vs. microservice choices; the most sophisticated solution isn't always the best.

Automation and Observability Burden

Zero downtime deployment doesn't stop at infrastructure and database compatibility. To successfully implement such a strategy, you need a very robust automation and comprehensive observability infrastructure. Without these, zero downtime deployment becomes a gamble and can lead to bigger problems instead of preventing downtime.

Automation means your CI/CD pipeline runs flawlessly. Code changes must be automatically tested, built, create images, and deployed to the target environment. Complex strategies like rolling updates or blue-green deployments must be correctly orchestrated within this pipeline. When I was developing an ERP for a manufacturing firm, setting up and stabilizing this pipeline took us months. We integrated automatic health checks, rollback mechanisms, and notifications at every step. For a simple project, trying to set up this level of automation often feels like "swatting a fly with an elephant gun."

ℹ️ Prometheus Health Check Example

One of the most critical steps in zero downtime deployment is quickly detecting whether the newly deployed application is healthy. As an example, let me show how an application's health can be monitored via a health check endpoint in Prometheus. Let's say your application returns 200 OK at the /health endpoint.

# prometheus.yml config snippet
scrape_configs:
  - job_name: 'my_app_health_checks'
    metrics_path: '/health'
    static_configs:
      - targets: ['my-app-instance-1:8000', 'my-app-instance-2:8000']
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):8000'
        target_label: instance
        replacement: $1

With this configuration, Prometheus regularly checks the health status of each application instance. If a new instance starts returning an error code from the /health endpoint during a deployment, an automatic rollback process can be triggered.

# alert.rules.yml snippet for a failing health check
groups:
  - name: my_app_health_alerts
    rules:
      - alert: ApplicationHealthCheckFailed
        expr: |
          http_requests_total{job="my_app_health_checks", status="200"} == 0
          AND
          up{job="my_app_health_checks"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Application health check failed ({{ $labels.instance }})"
          description: "The application's health check has been failing for over 1 minute. Check the deployment status."

This type of alert system enhances deployment safety but requires additional time for setup and maintenance. For a small project, setting up this level of observability infrastructure is often overkill.

Observability, on the other hand, means continuously monitoring your application's performance and health. Through metrics, logs, and traces, you need to be able to instantly see if the newly deployed version is performing worse than the old version or if it's producing unexpected errors. This information allows you to make a quick rollback decision when a problem arises. Even for my own side projects, I manage with basic log tracking and fundamental metrics, while for enterprise projects, I use detailed tracing and error budgeting. Setting up this complexity for a small project can require more effort than the project itself. Last month, while configuring Redis OOM eviction policy settings on a VPS, I realized a performance regression late because I hadn't collected the right metrics. This reminded me once again of the importance of observability, but it also showed me that not every project needs the same level of detail.

Alternative Approaches: Practical Solutions for Simple Projects

Not every project needs zero downtime deployment; that's my firm stance. Especially for small and medium-sized projects, avoiding the burden of ZDD while still providing a good user experience is possible with more practical and simpler alternatives. These alternatives keep costs low while making the deployment process manageable and predictable.

The first approach is to deploy with planned downtime. This is very effective, especially for internally used applications or during time periods when users in a specific geographical region are not active. Users are informed in advance ("Maintenance work will be performed on Saturday night between 02:00-03:00") and the application remains unavailable during this period. This way, the deployment team works with less stress, and there's no need to set up complex ZDD mechanisms. When updating the backend of my financial calculators, I use this method; a 10-minute downtime I perform during night hours doesn't bother anyone.

💡 User Notification with a Simple Maintenance Page

Using Nginx to display a simple maintenance page is one of the easiest ways to inform users during planned downtimes.

# /etc/nginx/sites-available/my_app
server {
    listen 80;
    server_name example.com;

    # Uncomment the following lines to enable maintenance mode
    # return 503;
    # error_page 503 @maintenance;

    # location @maintenance {
    #     root /var/www/html; # Directory where the maintenance page is located
    #     rewrite ^(.*)$ /maintenance.html break;
    # }

    location / {
        proxy_pass http://localhost:8000; # Address where the application is running
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Uncommenting the return 503; and error_page lines before deployment and commenting them out after deployment allows you to enable and disable maintenance mode in seconds. Having a simple HTML page in the /var/www/html/maintenance.html file is sufficient.

The second alternative is to focus on fast rollback mechanisms. If a problem occurs during deployment, you should be able to quickly revert to the previous stable version. This is usually achieved by keeping deployed images (like Docker images) or virtual machine snapshots ready. Before making a new deployment, saving the old version's components (application code, database backup, configuration) as a "rollback point" can prevent potential disasters. When updating my Docker Compose applications running on my VPS, I always keep the previous docker-compose.yml file and application images ready. This way, if a problem occurs, I can revert with a few docker-compose down && docker-compose up -d commands.

These approaches significantly reduce the complexity and cost associated with zero downtime deployment. By focusing on the project's actual needs, it's possible to avoid unnecessary engineering and still achieve a reliable, manageable deployment process. Especially for projects requiring rapid iteration, this simplicity offers me more flexibility.

Conclusion: Using the Right Tool for the Right Job

Zero downtime deployment is, without a doubt, a powerful and important strategy. It's indispensable for large-scale, business-critical systems. However, what I've seen in my 20 years of field experience is that not every project needs this level of complexity. Especially for small and medium-sized projects, the cost of setting up and maintaining zero downtime deployment can far exceed the benefits it brings.

The cost calculations, backward compatibility challenges brought by database migrations, and the necessity of comprehensive automation and observability infrastructure create unnecessary burdens for simple projects. These burdens slow down development speed, increase operational complexity, and strain budgets. My "it'll be fine" philosophy comes into play in such situations. It's possible to manage deployment processes without significantly impacting users and without performing unnecessary engineering, using planned downtimes, simple maintenance pages, or fast rollback mechanisms.

In summary, instead of always trying to implement the most sophisticated solution, it's best to adopt a pragmatic approach by considering the project's real needs, user expectations, and available resources. Zero downtime deployment is the right tool, but not for every job. In my next post, I'll discuss the VLAN tagging complexities I encountered while doing network segmentation and the insidious problems they brought.

Top comments (0)