How blue/green deployments saved us from out of hours changes and downtime

#cloud #architecture #devops #sre

I was working on a project that provided a critical service to users that accessed the database (DB) throughout the day. In order to provide regular updates to the DBs, the DBAs would often perform these changes out of hours to avoid any impact to the live service. As the changes were required more frequently it became harder to coordinate these out of hours and fully test before the service became live the following day. This is where blue/green deployments came in.

What are blue/green deployments?

A blue/green deployment approach is a release management strategy that uses 2 identical production environments referred to as “blue” (live/production) and “green” (idle/new version to be updated). Only one environment serves traffic to users while the other remains inactive. This allows teams to deploy updates to the server that doesn’t handle live traffic yet which minimises downtime and reduces risk.

How was this implemented?

Firstly, a new database was spun up that was a replica of the current production (blue) environment which would act as the green server.

A Network Load Balancer (NLB) was also provisioned with 2 target groups with an IP address target type each pointing to the blue and green database server IPs. By default the deregistration delay on the NLB is set to 300 seconds for the existing connections to drain and close on their own, allowing for existing requests to complete successfully after which it will change the state of the target to unused. This delay was set to 0s as we wanted live traffic to be switched over immediately to use the updated database.

What was the impact?

A major advantage of implementing this approach was the ability to switch traffic from the blue to the green database instantly. With the NLB configured for immediate switching, we could deploy changes to the green database throughout the day, test them in a production like environment without impacting live traffic and switch back if necessary. This reduced the downtime we previously faced during out of hours deployments.

With the green database fully tested in parallel with the live environment, we could validate all changes before pushing them into production. This created a safety net as any issues that were identified during testing on the green server did not affect the live service. It also provided us with more time for testing as this could be done during the day whilst the blue server continued to serve live traffic.

In the past if a deployment failed, it was often a frantic scramble to fix issues before the service resumed the following day. Now with the ability to switch back to the blue environment by simply repointing the target group, rolling back a deployment became straightforward and low risk.

Whilst this meant that we had to ensure that the green environment was fully in sync with the blue environment and any discrepancies between the two could lead to unexpected issues when live traffic was switched over, it saved us a lot of time and improved the reliability of the overall deployment process.

DEV Community

How blue/green deployments saved us from out of hours changes and downtime

What are blue/green deployments?

How was this implemented?

What was the impact?

Top comments (0)