DEV Community

Poojan
Poojan

Posted on • Originally published at poojan.technokari.com on

Zero-Downtime Deployments with PM2 and NGINX

When your platform processes meal orders across busy institutional cafeterias, downtime isn't just inconvenient - people don't get fed. High availability is a hard operational requirement, not just a product target.

The original deployment process was: SSH into the server, git pull, npm install, and pm2 restart all. This caused a 10–30 second outage per deploy , during which our API dropped orders and hung terminals.

Here is exactly how I set up graceful, zero-downtime deployments for our Node.js server cluster using PM2's native clustering and NGINX upstream proxies.


1. PM2 Cluster Configuration

The foundation of zero-downtime is simple: never let the number of active app instances drop to zero. PM2's cluster mode makes it straightforward to scale your process across all cores:

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'mealpe-api',
    script: './dist/server.js',
    instances: 'max', // Scale to all available CPU cores
    exec_mode: 'cluster',
    max_memory_restart: '500M',
    listen_timeout: 10000, // Wait 10s for boot signal
    kill_timeout: 5000 // Wait 5s for clean close
  }]
};

Enter fullscreen mode Exit fullscreen mode
  • instances: 'max' : Spawns a process on each CPU core.
  • listen_timeout : Instructs PM2 to wait for a database connection or socket handshake before marking a new process as "Online."
  • kill_timeout : Gives active processes 5 seconds to wrap up in-flight REST queries before forcing a close.

2. Transitioning to Graceful Reloads

Instead of calling pm2 restart (which kills all instances simultaneously), we transition to pm2 reload.

Reload initiates a rolling update : it spawns a new instance, waits for it to become online, then safely turns down an old instance. It repeats this pattern process-by-process, maintaining maximum API capacity:

# Production deploy script (deploy.sh)
#!/bin/bash
set -e

echo "Pulling latest branch code..."
git pull origin main

echo "Installing production-only dependencies..."
npm ci --production

echo "Executing rolling reload..."
pm2 reload ecosystem.config.js --update-env

echo "Deploy successfully completed!"

Enter fullscreen mode Exit fullscreen mode

3. Implementing Application Graceful Shutdowns

PM2 sends a SIGINT trigger to your process before shutting it down. If your application doesn't handle this signal, it terminates instantly, dropping all connections mid-transaction.

You must catch the SIGINT event, close the HTTP port to block new inbound traffic, finish active queries, and release database pools:

// ✅ Professional graceful shutdown hook in server.js
process.on('SIGINT', () => {
  console.log('SIGINT signal received. Starting graceful shutdown sequence...');

  // Stop the HTTP server from accepting new socket sessions
  server.close(async () => {
    console.log('HTTP server successfully closed.');

    try {
      // Release database connection pools cleanly
      await db.end();
      console.log('Database pools released. Exiting cleanly.');
      process.exit(0);
    } catch (err) {
      console.error('Error during database teardown:', err);
      process.exit(1);
    }
  });

  // Force close after a 6-second timeout block if connections hang
  setTimeout(() => {
    console.warn('Forced shutdown active: connections did not close in time.');
    process.exit(1);
  }, 6000);
});

Enter fullscreen mode Exit fullscreen mode

4. Configuring NGINX Load-Balancing

NGINX is our front-door gateway. The key configurations for zero-downtime routing include setting up an upstream pool and instructing NGINX to pass traffic to active processes on failures:

upstream mealpe_backend {
    server 127.0.0.1:3000;
    keepalive 64; # Keep connection channels open to reduce latency
}

server {
    listen 443 ssl http2;
    server_name api.mealpe.in;

    location / {
        proxy_pass http://mealpe_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # 🚀 Crucial: if one instance is reloading, try next active worker
        proxy_next_upstream error timeout http_502 http_503;
        proxy_connect_timeout 3s;
        proxy_read_timeout 15s;
    }
}

Enter fullscreen mode Exit fullscreen mode

By adding proxy_next_upstream, if an instance is in the middle of reloading and drops a connection, NGINX instantly retries the query on a sibling process. The client has zero awareness of the deploy event.


5. Webhook CI/CD Automation

To secure execution, we set up a lightweight deployment daemon on our EC2 instance that exposes a secured webhook route. When a PR merges into main on GitHub, our actions pipeline runs tests, builds the typescript bundle, and pings the webhook:

# GitHub Actions CI deployment step
- name: Trigger Server Deploy Webhook
  run: |
    curl -X POST \
      -H "Authorization: Bearer ${{ secrets.DEPLOY_SECRET }}" \
      https://api.mealpe.in/webhooks/deploy

Enter fullscreen mode Exit fullscreen mode

The daemon checks the bearer token (a shared secret) in constant time before running anything. This triggers the automated deploy.sh script locally on the server.


The Outcome

  • 0 seconds of user-facing downtime recorded.
  • 45-second deployment pipeline from git merge to production availability.
  • Confidence to deploy minor changes and hotfixes safely during standard hours.

Top comments (0)