loading...

Blue/Green Node.js Deploys with NGINX

justincy profile image Justin ・12 min read

I recently faced a situation where I needed to deploy Node.js apps to my own servers1. When I started this endeavor, I tried to find helpful material so that I didn't have to figure it all out myself, but all I could find was "use NGINX" and "probably use pm2." Those were helpful recommendations, but it still left a lot of details for me to figure out. In this post, I'll discuss the issues I faced and the solutions I chose so that maybe it helps someone else in the future who is facing similar problems.

We'll cover the following topics:

Requirements

  • Zero-downtime deploys. I could easily justify to management that it's too complicated and we must have a maintenance window, but zero-downtime deploys are expected these days, especially for front-end apps. For my own sake (my pride and my conscience), I wanted to make this happen.
  • Automatically deploy whenever the master branch is updated. I don't know how common this is but I've been doing this for years with Heroku and I can't imagine any other way of developing. Manually triggering deploys feels archaic.
  • Deploy to existing machines. The deploy targets would be a set of production VMs that are currently in use. I did not have the option of using new VMs and swapping out the old ones.

Implementation

We already used GitHub Actions to run tests against all PRs so I figured we'd also use them to trigger deploys when the master branch is updated.

Conceptually, I imagined the process would look something like this:

  • A push to master triggers a deploy
  • Connect to all deploy targets (servers) and run a script that installs and runs the new code
  • Divert traffic from the old code to the new code
  • Cleanup the old code

It took me 3-4 days to get from that high-level outline to the final implementation. I'll explain where I ended up and why I made certain choices.

Verifying Host Keys

One of the first issues I ran into was verifying the host keys. When you first ssh into a machine, a prompt asks you whether you trust the remote server's key. But I was running this in a script so I needed to avoid that prompt. You can disable it, but that's considered dangerous because of potential man-in-the-middle attacks. An alternative is to use ssh-keyscan to automatically add the remote keys to your trusted list.

ssh-keyscan "$IP" >> ~/.ssh/known_hosts

But I don't see how that's any more secure. Either way, you're blindly trusting the IP. What are the alternatives? Perhaps you could manually run ssh-keyscan once for each host and then store the result in a config that then get's added to known_hosts.

Remotely Executing a Deploy Script on the VMs

I had a list of IPs that were deploy targets and an SSH key. Somehow I needed to run a set of commands on the VMs that would actually perform the deploy. The set of commands started small so I began by using appleboy/ssh-action.

      - name: SSH Commands
        uses: appleboy/ssh-action@v0.1.3
        env:
          GH_TOKEN: ${{ secrets.GH_TOKEN }}
        with:
          host: ${{ secrets.DEPLOY_IP }}
          username: ${{ secrets.DEPLOY_USERNAME }}
          key: ${{ secrets.SSH_KEY }}
          script_stop: true
          envs: GH_TOKEN
          script: |
            cd /srv/bg
            git clone --depth 1 "https://${GH_TOKEN}@github.com/Org/Repo.git"
            cd bg-web
            npm i
            npm run build
            npm run start

But my short list of commands quickly grew and I soon desired to maintain a bash script that would be remotely executed. So I switched to something like this:

      - name: Deploy
        run: | 
          KEY_FILE=$(mktemp)
          echo "${{ secrets.SSH_KEY }}" > "$KEY_FILE"
          ssh -i $KEY_FILE ubuntu@${{ secrets.DEPLOY_IP }} -- < deploy.sh

That worked well. I particularly enjoyed having syntax highlighting while working on the deploy script. But eventually I wanted more, such as logging output of the deploy script to a temporary log file and passing env vars to the script. I decided to just copy the deploy script onto the VM before executing. I already had an SSH key available which made this easy with scp:

# Transfer the deploy script onto the VM so that we can execute it later.
# If we have previously deployed to the VM, an older version of the script will be there and be overwritten with the latest version.
scp -i $KEY_FILE /scripts/deploy.sh ubuntu@$IP:~/

# Execute the deploy script and save the logs to a temp file.
ssh -i $KEY_FILE ubuntu@$IP "tmpfile=$(mktemp /tmp/deploy.XXXX); echo \"Deploy log for $IP saved in \$tmpfile\"; GH_TOKEN=$GH_TOKEN IP=$IP REPO=$REPO bash deploy.sh > \$tmpfile 2>&1"

That's what I ended up. The only thing I don't like about it is the list of environment variables (the list is actually a lot longer in the version I'm using). If you know of a better way, please let me know.

Managing the Node.js Processes with PM2

Node.js is single-threaded which means you need to run multiple instances of the same process in order to use all available CPU cores. Typically this is done with the Cluster API. I've used it before and I didn't want to use it again. You have to setup a master file that spawns processes and manages their lifecycle, handles errors, respawns processes that die, etc. Instead of handling all that myself, I chose to use pm2. Now clustering an app is as simple as:

pm2 start -i max --name $PROCESS_NAME $START_COMMAND

Later, when I need to cleanup the old code, I can use pm2 list to find any processes which don't match the new $PROCESS_NAME and kill them with pm2 delete. More on that in the next section.

Blue/Green Deploys

A blue/green deployment is one way to achieve zero-downtime deployments by spinning up a new server then routing traffic to it before retiring the old server. However, I didn't have the affordance of using a new server so I had to accomplish the same thing on an existing server.

Traffic would come in on port 80 or 443. Binding to those ports requires root privileges. But you don't want your web app to have root privileges. So you can either use iptables to redirect port 80 to your app, or you can use NGINX. We chose NGINX because it offers much more in the way of HTTP configuration which we anticipate needing in the future (SSL certificates, headers, etc.).

We start off with a conf file in /etc/nginx/site-enabled that looks like this:

server {
  listen 80;
  server_name domain.com;
  location / {
    proxy_pass http://localhost:3000;
  }
}

Later, when we deploy a new script, port 3000 is already used so we need to use a different port. We could constantly swap back and forth between port 3000 and 3001, but keeping track of which port is currently being requires state and feels fragile. So I opted with randomly generating a port each time, then checking that it's not currently being used.

# Picks a random number between 3000 and 3999.
function random-number {
  floor=3000
  range=3999
  number=0
  while [ "$number" -le $floor ]
  do
    number=$RANDOM
    let "number %= $range"
  done
  echo $number
}

# Pick a random port between 3000 and 3999 that isn't currently being used.
PORT=$(random-number)
while [[ $(lsof -i -P -n | grep :$PORT) ]]
do
  PORT=$(random-number)
done

echo "Ready to deploy on port $PORT"

I also used the port number in the directory where I installed the code (to make sure there weren't any conflicts with previous installations) and to identify the processes with registering them with pm2.

Now we update the NGINX conf:

sudo cat << EOF | sudo tee /etc/nginx/sites-enabled/site.conf > /dev/null
server {
  listen 80;
  server_name domain.com;
  location / {
    proxy_pass http://localhost:$PORT;
  }
}
EOF

Though the configuration file has changed, NGINX isn't yet aware of it. We can tell it to reload the file by sending the reload signal:

sudo nginx -s reload

The NGINX docs say that this is supposed to happen gracefully:

It starts new worker processes, and sends messages to old worker processes requesting them to shut down gracefully. Old worker processes close listen sockets and continue to service old clients. After all clients are serviced, old worker processes are shut down.

That's wonderful. It takes care of gracefully transferring traffic so that we don't have to. However, it doesn't emit a signal when the transfer is done. So how do we know when we can retire and cleanup the old code?

One way is by watching traffic to your processes. But that sounds complicated to me. There are multiple processes. How do I know when traffic is done going to all of them? If you have any ideas here I'd love to hear. But I went with a different solution.

I realized that NGINX had a fixed number of worker processes (which appears to be tied to the number of CPU cores). But the paragraph I quoted above about reloading says it starts new workers in parallel to the old, so during the reload you have 2x the number of workers. Therefore I figured I could count the number of worker processes before the reload and then wait until the number of workers returned to normal. It worked.

function nginx-workers {
  echo $(ps -ef | grep "nginx: worker process" | grep -v grep | wc -l)
}

# Reload (instead of restart) should keep traffic going and gracefully transfer
# between the old server and the new server.
# http://nginx.org/en/docs/beginners_guide.html#control
echo "Reloading nginx..."
numWorkerProcesses=$(nginx-workers)
sudo nginx -s reload

# Wait for the old nginx workers to be retired before we kill the old server.
while [ $(nginx-workers) -ne $numWorkerProcesses ]
do
  sleep 1;
done;

# Ready to retire the old code

It's not 100% zero-downtime. I did load testing to confirm that there's about a second of down time. I don't know whether that's because I'm still killing the old processes too early or if it's because NGINX is refusing connections. I tried adding more sleep after the loop to make sure all connections had drained and terminated but it didn't help at all. I also noticed that the errors (during the load test) were about not being able to establish a connection (as opposed to the connection being terminated early) which leads me to believe it's due to NGINX reloads not being 100% graceful. But it's all good enough for now.

Now we're ready to cleanup the old code:

# Delete old processes from PM2. We're assuming that traffic has ceased to the
# old server at this point.
# These commands get the list of existing processes, pair it down to a unique
# list of processes, and then delete all but the new one.
pm2 list | grep -o -P "$PROCESS_NAME-\d+" | uniq | while IFS=$'\n' read process; do
  if [[ $process != $PROCESS_NAME-*$PORT ]];
  then
    pm2 delete $process
  fi
done

# Delete old files from the server. The only directory that needs to remain
# is the new directory for the new server. So we loop through a list of all
# directories in the deploy location (currently /srv/bg) and delete all
# except for the new one.
echo "Deleting old directories..."
for olddir in $(ls -d /srv/bg/*); do
  if [[ $olddir != /srv/bg/$PORT ]];
  then
    echo "Deleting $olddir"
    rm -rf $olddir
  else
    echo "Saving $olddir"
  fi
done;

Parallel Deploys

I first got the blue/green deploy working on one machine. I figured it'd be easy to change so that it works on multiple machines by looping over a list of IP addresses. It probably would've been easy if I had done the deploys serially, but I wanted to do the deploys in parallel to reduce the time spent on the deploy. I was hoping I could just background the ssh command ssh &. But I got some error message about how that was wrong. Searching the internet revealed a host of alternatives that didn't work or that didn't easily provide a child process ID (more later on why we need that). I finally ended up just creating another bash script that had the scp and ssh commands. Then I could easily background the execution of that bash script.

# Turn the list of IPs into an array
IPS=( $DEPLOY_IPS )
for IP in "${IPS[@]}"; do
  echo "Preparing to connect to $IP"
  # Here's that list of env vars again
  KEY_FILE=$KEY_FILE GH_TOKEN=$GH_TOKEN IP=$IP REPO=$GITHUB_REPOSITORY bash /scripts/connect.sh &
done

So I ended up with this trio of scripts:

deploy-manager.sh -> connect.sh -> deploy.sh

But how do I know when the deploys are done and how will I know if one of them fails? I found a nice solution on the Unix & Linux StackExchange website. You just collect the child process IDs, then wait on all of them to make sure their exit codes are 0.

What do you do if the deploy fails on one machine but succeeds on another? I haven't addressed that problem yet. Any ideas?

Reusable Private GitHub Action

After I got this all working in one repo with multiple deploy targets, I decided to move it into a private GitHub Action so that it could be shared across multiple Node.js apps. I expected this to be easy because I already had all the working code. But as always, I was wrong.

First, GitHub doesn't officially support private actions, but you can get around it with a handy solution.

GitHub offers two implementation choices for custom actions: Node.js or Docker. I have written Node.js actions before and I didn't enjoy the experience as much as I had hoped. It requires you to commit bundled code to your repo because it doesn't install dependencies for you. You can probably get away without using deps if you work hard at it, but it's even more inconvenient to not use @actions/core. It also feels wrong to write a node script that just executes a bash script. So I decided to create a Docker action.

I assumed that all I needed was vanilla dockerfile that would execute the deploy-manager.sh script. But I quickly ran into problems. My scripts were developed to execute on the GitHub workflow runners. I specified ubuntu-latest and assumed it was a pretty vanilla install. But it turns out they install tons of software and unfortunately do not have it available a docker container. Luckily, all I needed to install was openssh-server. Here's my final Dockerfile:

FROM ubuntu:18.04

RUN apt update && apt install -y openssh-server

COPY scripts/*.sh /scripts/

ENTRYPOINT ["/scripts/deploy-manager.sh"]

I ran into another problem. Host key verification started failing when I switched to the Docker action. It's because Docker GitHub Actions are run as root while I developed the scripts running as the user ubuntu. Users have their own known_hosts file located at ~/.ssh/known_hosts. But for root I needed to modify the global file located at /etc/ssh/ssh_known_hosts.

I was pleased to learn Docker, but I might reevaluate the decision to use it. Is it better to build a container every time an action runs or to commit bundled code to your action repo? 😬

Scrubbing Secrets in GitHub Action Logs

If you want have custom environment variables in GitHub Workflows, your only option is to use Secrets. One of my secrets stores the list of IPs for the deploy targets. But it's not really something I need to keep private and is often useful in debug logs.

GitHub scrubs action logs to automatically redact secrets. Since my IPs were in a list and I was only printing one, I figured it wouldn't be redacted. But it was! They must be doing partial matching on the secrets (I wonder what length of characters they use). To get around this, I used an $UNSECRET_IP variable that was $IP with all the dots replaced with dashes. Sure enough, it wasn't redacted.

UNSECRET_IP=$(echo $IP | tr . -)

Conclusion

That's a lot of work, and it doesn't even handle partial deploy failures, roll-backs, or log management. I imagine I'll spend quite a bit of time maintaining this creation. It's cemented by belief in the value of PaaS providers. I'd much rather pay someone to do this for me and to do it much better than I can.


  1. I prefer using PaaS providers like Heroku, Netlify, and Vercel so that I don't have to do everything discussed here 😂. 

Posted on by:

Discussion

markdown guide
 

I think there was a great learning experience in all of this, but here are some resources to consider researching if you haven't:

  • Ansible - makes managing remote servers easier, especially the bit about running a script remotely
  • Haproxy - reverse proxy to use instead of nginx, might be better at swapping backends gracefully
  • Kubernetes (potentially too heavy for this use, but handles basically everything you built here)

Also just a note from the outside about the port switching: your solution chooses a random port and checks if it is available. You could apply this logic to the original idea of swapping between 2 ports. Port A is the default, check if Port A is available. If it is, use it. If it isn't, use Port B. No state to keep track of, you're just dynamically choosing whether to use the port or to use an alternate one.

 

Thank you. That's helpful. I particularly like your idea about how to get rid of the random port generation. That would remove the need for cleaning up the old code (because I would only ever have two copies of it on the machine, instead of up to 1,000) and would open the path to quick rollbacks.

 

That's a great point about the quick rollbacks. It would make the blue/green nature even better.

 

Did you check CapRover (caprover.com)? It's basically a self-hosted Heroku 🤷‍♂️. I like it. You can even manage a cluster with it.

 

Holy smokes! I haven't heard of that before. I gotta try it out. Thank you!

 

Jordan already provided a list of tools that can improve the process. I just want to add about several moments in your process:

  • if there is a daemon manager (service or systemctl), it's better to use it instead of nginx -s SIGNAL because the manager won't know anything about such reloads and won't be able to interact with the running process anymore.
  • checking when to remove the old code directory - since you know the old port, why don't just check if there are open connections to them? No connections - remove the code
  • the 1-second delay - when you do the switch, are you sure the new deploy is ready to be served? Perhaps it makes sense to make a readiness request before the switch? Anyway, you can use tcpdump (or anything similar) to record network traffic, with port/protocol filtering, and figure out to which backend (old or new port) is was made and failed to complete.
  • to run tasks in parallel, you can use parallel-ssh
  • speaking about one failed host - there are basically only two options if your scripts remove the old code and do it in parallel on several hosts. You can re-deploy for all or just for the failed host. Even rollback here is a re-deploy. By the way, you can keep several old release folders to stop worrying about the right time to remove the previous release and to be able to revert your deployment (however you would need to track which is the previous one, but it is solvable)
 

Thanks for taking the time to read this post and to share your recommendations for improvements. They're helpful.

 

Part of your solution could be tightened up with this simple command.

    setcap 'cap_net_bind_service=+ep' /usr/bin/node

Here's the full story. How to Configure Node.js to Use Port 443
"Granting server access to well-known, privileged Linux system ports."

 

Great article it looks like you covered everything up.
I do have a question for you, how much time did spent to put all this together?

 

It took me 3-4 days to figure it out and code it up and then 1 day to write the blog post about it (because I was sick).