Matia Rašetina

Posted on Oct 29

I Built a Safety Net for My AWS Deployments - Here's the Code

#aws #programming #cloud #devops

Imagine that you are the lead developer on a project and you are deploying the latest features and updates to production - no issues found in testing, QA team reporting that everything is working as expected - you start the deployment process and put your feet up, thinking “my job here is done!”

All of a sudden, the whole production goes down! Nothing is working:

customers are sending emails about the outage of your application
tensions are rising inside your team
you are getting pressure from management, asking you how did this happen?

You fix the underlying issue and think to yourself, “What can I do to make sure this never happens again?”

One of the most crucial moments in Software engineering is updating an existing application, making sure that your customers, your users, do not even notice that something’s changed. That’s where the canary release method comes in.

In this blog post, we are going to go over what a canary release is, and I will give you the AWS CDK Python code, which you can reuse in your projects right away! One note, this blog post will be a shorter one as I’m preparing a big end-of-year project with many AWS services we’ve covered this year, and it will have canary releases implemented in it. I just thought that canary releases should be covered in a short blog post, since I believe they could help you in every project you do.

Prerequisites are:

installed Docker and knowledge about containers
Python and AWS CDK experience
general knowledge of AWS services Elastic Container Registry (ECR), Elastic Container Service (ECS), CodeDeploy and load balancing terminology

Link to the Github project is: https://github.com/mate329/aws-canary-deployment-with-cdk

Here we go!

What is a canary release deployment strategy?

The name of this deployment strategy came from the coal mines as an early alarm system — miners would take a bird called a canary, since birds are more sensitive to toxic gases. If the bird stays healthy, more miners would access the mines, but if the bird becomes sick or dies, then the miners would evacuate.

In software engineering, the steps for a canary release are as follows:

deploy the latest changes to a subset of users (e.g., 10% of users)
Monitor the response times, HTTP error codes, and container crashes of the latest deployment
Now we have 2 scenarios:

a. If everything goes alright and works as expected, deploy the last 90%

b. If something is wrong, rollback the latest deployment and replace it with the old one

This makes deploying the latest version of your backend so easy and safe.

How do we set this up on AWS? We are going to use CloudWatch Alarms to monitor our deployment progress and the possible issues that could arise in combination with CodeDeploy, which controls the processes of deployment. The metrics that we are looking for are:

HTTP 5xx errors
response times bigger than 2 seconds
instances not responding to the health check API call — a simple GET HTTP request to the /health endpoint

Let’s take a look into the CDK code and define our resources.

Key components of our CDK stack

Defining the 2 Target groups

In the following code, we are going to define the ECS service, which contains our application instances, and define the 2 target groups — one for the old, and one for the new version of the application:

# Target Group 1 (Blue deployment)
target_group_1 = elbv2.ApplicationTargetGroup(
    self,
    "TasksApiTG1",
    vpc=vpc,
    port=8080,
    protocol=elbv2.ApplicationProtocol.HTTP,
    target_type=elbv2.TargetType.IP,
    health_check=elbv2.HealthCheck(
        path="/health",
        interval=Duration.seconds(30),
        timeout=Duration.seconds(5),
        healthy_threshold_count=2,
        unhealthy_threshold_count=3,
    ),
    deregistration_delay=Duration.seconds(30),
)

# Target Group 2 (for Green)
target_group_2 = elbv2.ApplicationTargetGroup(
    self,
    "TasksApiTG2",
    vpc=vpc,
    port=8080,
    protocol=elbv2.ApplicationProtocol.HTTP,
    target_type=elbv2.TargetType.IP,
    health_check=elbv2.HealthCheck(
        path="/health",
        interval=Duration.seconds(30),
        timeout=Duration.seconds(5),
        healthy_threshold_count=2,
        unhealthy_threshold_count=3,
    ),
    deregistration_delay=Duration.seconds(30),
)

Defining the CloudWatch alarms

Let’s define our CloudWatch alarms which are monitoring both new and old versions. Remember, the metrics we are looking for are:

HTTP 5xx errors
response times bigger than 2 seconds
container not responding to health checks

# Alarm 1: Unhealthy host count on BLUE target group (production baseline)
unhealthy_host_alarm_tg1 = cloudwatch.Alarm(
    self,
    "UnhealthyHostAlarmTG1",
    metric=target_group_1.metrics.unhealthy_host_count(
        statistic="Maximum",  # Use Maximum to catch any unhealthy host immediately
        period=Duration.seconds(30),  # Check every 30 seconds (more frequent)
    ),
    evaluation_periods=1,  # Trigger immediately after 1 period
    threshold=0,  # Trigger if ANY host becomes unhealthy (>0)
    comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
    alarm_description="Triggers IMMEDIATELY if any unhealthy hosts in Blue TG",
    treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
)

# Alarm 2: Unhealthy host count on GREEN target group (NEW VERSION - CRITICAL)
unhealthy_host_alarm_tg2 = cloudwatch.Alarm(
    self,
    "UnhealthyHostAlarmTG2",
    metric=target_group_2.metrics.unhealthy_host_count(
        statistic="Maximum",  # Use Maximum to catch any unhealthy host immediately
        period=Duration.seconds(30),  # Check every 30 seconds (more frequent)
    ),
    evaluation_periods=1,  # Trigger immediately after 1 period
    threshold=0,  # Trigger if ANY host becomes unhealthy (>0)
    comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
    alarm_description="Triggers IMMEDIATELY if any unhealthy hosts in Green TG (NEW VERSION)",
    treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
)

# Alarm 3: HTTP 5xx errors on GREEN target group
http_5xx_alarm_tg2 = cloudwatch.Alarm(
    self,
    "Http5xxAlarmTG2",
    metric=target_group_2.metric_http_code_target(
        code=elbv2.HttpCodeTarget.TARGET_5XX_COUNT,
        statistic="Sum",
        period=Duration.seconds(30),  # Check every 30 seconds
    ),
    evaluation_periods=1,  # Trigger immediately after 1 period
    threshold=2,  # More than 2 errors in 30 seconds (very sensitive)
    comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
    alarm_description="Triggers if more than 2x 5xx errors in 30s on Green TG (NEW VERSION)",
    treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
)

# Alarm 4: Response time on GREEN target group
response_time_alarm_tg2 = cloudwatch.Alarm(
    self,
    "ResponseTimeAlarmTG2",
    metric=target_group_2.metrics.target_response_time(
        statistic="Average",
        period=Duration.seconds(30),  # Check every 30 seconds
    ),
    evaluation_periods=1,  # Trigger immediately after 1 period
    threshold=1.5,  # 1.5 seconds (more sensitive than 2s)
    comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
    alarm_description="Triggers if response time exceeds 1.5s on Green TG (NEW VERSION)",
    treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
)

Defining the Application Load Balancer

Now, we need to create our Application Load Balancer, which distributes the traffic between the instances:

# Application Load Balancer
alb = elbv2.ApplicationLoadBalancer(
    self,
    "TasksApiALB",
    vpc=vpc,
    internet_facing=True,
    deletion_protection=False,
)

Defining the CodeDeploy process

We’ll use CodeDeploy to create the new instances inside our ECS service when the deployment is initiated. In the following code, we will use our already defined resources (like alarms and target groups) to define which resources CodeDeploy should use to be aware of the deployment status.

The deployment strategy we are going to choose is the CANARY_10_PERCENT_5_MINUTES , which deploys the new version of the application, it gets 10% of the overall traffic and after 5 minutes, if the CloudWatch alarms do not report anything wrong, CodeDeploy will replace the old application with the new one, so the new one then gets 100% of the traffic.

There are other strategies for deployment like:

LINEAR_10PERCENT_EVERY_3MINUTES - shifting 10 percent of traffic every three minutes until all traffic is shifted
ALL_AT_ONCE - shifting all traffic to the new instance right away
CANARY_10PERCENT_15MINUTES - similar to our chosen strategy — the instance gets 10% of traffic, but there’s a 15 minute window to confirm that no alarm goes off

Of course, the deployment strategy depends on your project and your team’s needs.

# CodeDeploy Application
codedeploy_app = codedeploy.EcsApplication(
    self,
    "TasksApiCodeDeployApp",
    application_name="tasks-api-app",
)

# CodeDeploy Deployment Group with Canary Configuration
deployment_group = codedeploy.EcsDeploymentGroup(
    self,
    "TasksApiDeploymentGroup",
    application=codedeploy_app,
    deployment_group_name="tasks-api-deployment-group",
    service=ecs_service,
    blue_green_deployment_config=codedeploy.EcsBlueGreenDeploymentConfig(
        blue_target_group=target_group_1,
        green_target_group=target_group_2,
        listener=prod_listener,
        test_listener=test_listener,
        termination_wait_time=Duration.minutes(5),  # Time before terminating old tasks
    ),
    deployment_config=codedeploy.EcsDeploymentConfig.CANARY_10_PERCENT_5_MINUTES,
    # This will:
    # 1. Shift 10% of traffic to new version
    # 2. Wait 5 minutes
    # 3. If no alarms triggered, shift remaining 90%
    alarms=[
        unhealthy_host_alarm_tg1,
        unhealthy_host_alarm_tg2,  # Critical: Monitor GREEN target group
        http_5xx_alarm_tg2,        # Monitor 5xx errors on new version
        response_time_alarm_tg2,   # Monitor response time on new version
    ],
    auto_rollback=codedeploy.AutoRollbackConfig(
        failed_deployment=True,
        stopped_deployment=True
        deployment_in_alarm=True,  # Rollback if CloudWatch alarms trigger
    ),
    role=codedeploy_role,
)

How do we deploy our project?

First off, you’ll need to create the Elastic Container Repository to save your Docker image of your backend on AWS. I followed the tutorial by AWS and click here to see the tutorial yourself! Make sure that you call your repository tasks-api .

After your repository was created, enter your terminal window, go to the server folder and follow the instructions which are available when clicking on View push commands inside the AWS console. You should get instructions like these:

After your Docker image has been uploaded, move to the cdk directory and now run cdk deploy to deploy your backend! As outputs of your deployment, you will get much information, like the ECS service name, CodeDeploy Application name, and most importantly, the URL to access your backend, which you can use to test the backend in API testing tools, like Postman. Here is the expected output of the cdk deploy command to deploy the needed infrastructure for this small project:

From now on, for each of the following deployments, canary deployment strategy will be available. When you have changes which you want deployed via canary deployment strategy, you can use the deploy.sh script which is located inside the scripts folder. Here is the example of the script running and deploying the newest version of your backend:

If you take a look inside CodeDeploy inside the AWS Console, you will find the following deployment process and it’s going to look like this when everything went well and no alarms went off:

But now let’s break our deployment!

While initiating another deployment, you can use the following commands inside your terminal window:

curl -X POST <LOAD_BALANCER_URL>/simulate/unhealthy/true - this endpoint is used to stop responding with the HTTP 200 OK code to health pings which our container and Load Balancer are expecting
curl -X POST <LOAD_BALANCER_URL>/simulate/crash/true - enabling simulation of an error found in code and enabling the “feature” of having slow responses
curl -X POST <LOAD_BALANCER_URL>/simulate/error - endpoint to get the slow responses on purpose or to get 500 HTTP codes, which should trigger our CloudWatch alarm

Example of running these commands can be found here:

If you go to your ECS service, the Console should show all of your tasks in the service and at least one of them should be in an unhealthy state, an example is in the following picture:

This situation will trigger our CloudWatch alarm, which will start our rollback inside CodeDeploy. If you navigate back to CodeDeploy deployment process, it should look like this:

, where the error message says:

One or more alarms have been activated according to the Amazon CloudWatch metrics you selected, and the affected deployments have been stopped. Activated alarms: <name_of_activated_alarm>

Conclusion

Remember that nightmare scenario from the beginning — one where everything went down and your customers were really unhappy? Well, with this setup, those days are behind you.

By implementing AWS ECS and CodeDeploy, you have built a safety net for your project, which catches problems before they reach your users. First, your deployment gets 10% of the overall traffic before being promoted to the main backend version of your application.

With this small project, it can give you

confidence to deploy more frequently without fear of your application being down
automated rollbacks in case of a failure
peace of mind that your production environment is protected

Happy deploying! See you in the next blog post!

DEV Community