DEV Community

Cover image for AWS re:Invent 2025 - Build safe and resilient deployment pipelines for Amazon ECS (CNS353)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Build safe and resilient deployment pipelines for Amazon ECS (CNS353)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Build safe and resilient deployment pipelines for Amazon ECS (CNS353)

In this video, Islam Mahgoub, a solutions architect and containers subject matter expert at AWS, demonstrates deployment strategies for Amazon ECS to achieve safe, resilient rollouts without downtime. He covers rolling update deployment, which gradually replaces old tasks with new ones using minimum healthy percent and maximum percent parameters, and blue-green deployment, which maintains parallel environments enabling quick rollbacks. Using a sample retail application, he shows live demos of both strategies, including configuration of target groups, listeners, and bake time settings. He also explains lifecycle hooks—Lambda functions that enable custom workflows like manual approval steps at specific deployment stages. The session highlights key trade-offs: rolling updates use less capacity but slower rollbacks, while blue-green uses more capacity but enables instant rollbacks.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Thumbnail 40

Introduction to Deployment Strategies and Rolling Update Implementation

Hi and thank you for joining this session. Many organizations are using Amazon to host their container-based applications. One of their key requirements is being able to roll out new versions in a safe and resilient way, which means they want to deploy these versions without downtime, and they also want to be able to roll back to the older version if any issue is discovered. In this session, we are going to talk about the different deployment strategies that you can use with Amazon ECS to implement safe and resilient deployment pipelines, and we're going to see these different strategies and how you can use them to address these resiliency requirements.

Thumbnail 60

My name is Islam Mahgoub. I work as a solutions architect at AWS and I'm also a containers subject matter expert. This is the agenda for today. We are going to start with a quick overview of deployments and then we'll talk about the rolling update deployment strategy. We'll talk about the blue-green deployment strategy, and then we'll end the session with a few key takeaways.

Thumbnail 80

So first, a quick overview of deployments. Services are defined through a combination of a task definition and service configurations. The task definition captures details about the container image that you want to run as part of your service, such as container image URL, networking configurations, and environment variables. When you create a service for the first time, the task definition along with service configurations are combined to create Service Revision 1. Then if you want to deploy a new version, you create a new task definition. This new task definition is combined with the service configuration and becomes your Service Revision 2.

Thumbnail 150

The process that takes you from Service Revision 1 to Service Revision 2 is persisted and tracked in an object called service deployment. This object contains information like the progress of the deployment, source revision information, and target service revision information as well. Now let's get into the rolling update deployment strategy and see how it works. If you want to deploy your new version without any downtime, one of the approaches you can use is gradual replacement of your tasks that are running the old version with newer tasks that are running the new version. This is what we call the rolling update deployment strategy.

Thumbnail 170

There are two key parameters that you need to configure for this to work properly. These are the minimum healthy percent and the maximum percent. The minimum healthy percent represents a lower bound on the number of tasks that you can have during this deployment. This is really important for your service availability because you don't want to be running with a lower number of tasks than what you need to serve your customers. The maximum percent represents the upper bound of the number of tasks that you can have under the service. These two parameters allow you to strike the right balance between availability, speed of the deployment, and cost during the deployment.

Thumbnail 210

Thumbnail 220

Thumbnail 230

Thumbnail 240

The way it works is we start with all the tasks running the old version, and then we gradually add new tasks running the new version and terminate the old ones that are running the old version until you get to a state where all the tasks running under the service are running the new version. Let's see how this works in action. I'm going to use this application, which is a sample retail application. It consists of one UI service, all running on the service, and four back-end services: the order, the checkout, the cart, and the catalog. All of these are microservices and they are backed by managed services for databases and for managed queues as well.

Thumbnail 280

Thumbnail 300

So we'll take this scenario now. We want to deploy a new version of the UI service. We want to make sure that this happens without downtime. So we'll use the rolling update deployment strategy for that. Let's first check the UI service. We'll go to the cluster. Under the cluster, there are a bunch of services. We'll get to that shortly. We go to the service. And here you see that this service is exposed through a load balancer and all the tasks are registered under one target group. You see here's the name of the target group. If you go to the deployment section, you will note that the rolling update deployment strategy is being used.

Thumbnail 310

Thumbnail 320

Thumbnail 330

Thumbnail 340

Thumbnail 350

Thumbnail 360

Thumbnail 380

Now let's push a change. We're going to change the background. Let's push it and see how we progress through the pipeline. First, the source code is checked out, the container images are built and pushed, and then we get to the deployment stage where we create a new task definition and update the service to point to this new task definition. We'll jump to the console and see how this is progressing on ECS. You can see the task deployment status is in progress, and here you see new tasks being added that refer to the new task definition. More tasks are getting added, and on the right-hand side we have the browser connected to the listener of the service, which will start showing the new background. This means the traffic is now going to the new tasks. After some time, you will note that the old tasks are getting deregistered and terminated. This is how we do the rolling update deployment strategy, and you can see it is completed without downtime.

Blue-Green Deployment Strategy: Quick Rollback with Parallel Environments

One thing to note here is that if you wanted to roll back to the previous version, it may take a little bit of time because you will have to reprovision the tasks that will run the old version. If you want to roll back quickly, then one of the strategies you may want to implement is the blue-green deployment strategy. This entails having two parallel environments: the blue environment and the green environment. The blue is the old version, and the green is the new version. You then switch from blue to green, and the blue is still kept alive. If things are not working as expected, you can roll back again to the blue environment.

Thumbnail 450

Let's see how this works. The first stage is starting with your original set of tasks running the old version. Then we get into the scaling stage. At the scaling stage, we are provisioning the green environment with tasks running the new version. Still, your traffic is entirely served by the old or original set of tasks, the blue ones. Then we get into the test traffic shift stage. With blue-green, you have the ability to segregate production traffic from test traffic. If you are talking about a service exposed using an Application Load Balancer, what you can do is define an additional listener. You create an additional listener that becomes your test listener, and this can be used for testing the new version before shifting the actual traffic or before changing the production listener to point to this new version.

Thumbnail 500

Thumbnail 510

Thumbnail 530

Thumbnail 540

At the test traffic shift stage, what happens under the hood by the deployment controller is the test listener is changed to route the traffic to the new version or the green environment. Then in the production traffic shift stage, the production listener is changed to point to this new green environment or the new tasks. Then we get to the bake time stage. At this stage, the two environments are kept. The traffic is served by the green or new environment, but the blue environment is still there. If any issue is discovered, you can roll back to this or reroute the traffic back to the blue environment. You have the ability to define how long this stage will last. Once the bake time lapses, we get into the cleanup stage, and at this stage, the blue environment is terminated, and you just have the green. The green again becomes a new blue the next time you deploy.

Thumbnail 570

Let's see how this works. The scenario we'll take here is deploying a change in the service. This time we want to be able to roll back quickly if things didn't go well. At the beginning, we need to change the service to use the blue-green deployment strategy. We'll go to the service and click update service. Then we'll go to the deployment options section and select the blue-green strategy. You need to configure the bake time, which is how long you want the blue environment to be kept in case you want to roll back.

Thumbnail 580

Thumbnail 600

After that, in the load balancing section, we will provide a role. The way the deployment controller shifts the traffic between blue and green is by changing the listener rules, so it needs to have the IAM permissions to do that. We're giving it the IAM role with the required permissions. After you do that, you select the production listener, you select the production listener rule, and the same thing for the test listener and the test listener rule.

Thumbnail 610

Thumbnail 630

For blue-green deployment, we are keeping these two environments—the blue and the green—so we need two target groups rather than just one for each. We need to provide the details of the additional target group here. Then we update the service. This will trigger a deployment, and once this deployment is completed, the blue-green deployment strategy is in effect.

Thumbnail 640

Thumbnail 650

We'll push another change to see how this will work with the blue-green strategy. The steps are very similar, but when we reach the deployment stage, we have two browser windows on the right-hand side. One is connected to the production listener at the top, and another is connected to the test listener at the bottom.

Thumbnail 660

Thumbnail 670

Now we are progressing through different stages. At this point, we are at the scale-up stage, where we are creating the green environment and the new tasks that will run the new version. You will notice that the number of tasks is doubling because we now have two environments. Then we get into the test traffic shift stage.

Thumbnail 680

Thumbnail 700

Here we are changing the rule for the test listener to route to the new tasks, and you see the new black background showing up in the lower right window, which is connected to the test listener. Then we get into the production traffic shift stage, where the production listener is being changed to route the traffic to the new tasks. Then we get into the bake time stage, where the two environments are kept. All the traffic is going to the new environment, the green, but the blue is still there. This allows us to roll back quickly if needed.

Thumbnail 710

Thumbnail 720

Thumbnail 730

Thumbnail 750

Thumbnail 760

Let's say we are unhappy with this version for any reason and would like to roll back. This should be quick because the blue environment is still there. We click on the rollback, and then the deployment status will reflect that the rollback is requested. Then we go again to the production traffic shift stage, where we are changing the production listener to reroute the traffic back to the blue environment or the old tasks. You see the yellowish background coming back again. The same thing will happen for the test listener, so we change it back to the old environment. Then we get into the cleanup stage, where we delete the new tasks that were created as part of this deployment. We're back to just three tasks, and the new environment is completely deleted.

Thumbnail 770

Thumbnail 800

Customizing Deployments with Lifecycle Hooks and Key Takeaways

Now, let's say you want to customize this deployment workflow a little bit. Imagine that you want to run some automated testing before you actually shift the production traffic. Or imagine that you would like to add a manual approval step or manual verification step—you want to hold the deployment after the test traffic shift so you can do this testing and then proceed. One thing you can use to achieve that is lifecycle hooks.

Thumbnail 830

Lifecycle hooks are Lambda functions that can be invoked at different points of the deployment workflow. These lifecycle hooks are expected to return either in progress, success, or fail. If it returns in progress, the deployment is on hold at this stage and keeps calling it again after a configured number of seconds. It keeps calling it until it returns either success or failure. If it returns success, the deployment will progress to the next stage. If it fails, a rollback will be requested.

Thumbnail 880

Let's take one of the scenarios I mentioned: implementing a manual approval workflow. Let's see how this can be implemented with lifecycle hooks. We will create a Lambda function configured as a hook for the post-test traffic shift stage. This Lambda function will initially set the state of the deployment as pending and save this state in an S3 bucket. Then a notification will be sent through SNS to a reviewer because this is invoked at the post-test traffic shift stage. The new version is already available under the test listener for someone to verify.

Once they look into it, verify it, and if they are happy with it, they can go to this state and change it from pending to approved. Then the next time the lifecycle hook is invoked, it will check the status, find it approved, and return success. That is one of the use cases for lifecycle hooks. Let's see how this would work in action. In this case, we'll make a change in the service, but this time we want to stop the deployment process at the post-test traffic shift so we can do some manual verifications.

Thumbnail 900

Thumbnail 910

Thumbnail 930

OK, so first we'll add the lifecycle hook. We'll go again to the service, and here we are going to update the service. In the lifecycle hook section, we are going to select the Lambda functions that we created with the logic I just explained. After you select the Lambda function, you need to provide an IAM role to the deployment controller that can invoke this Lambda function. So here we select this role. And then we need to select the lifecycle stage that we need this hook to be invoked at. In this case, it will be the post-test traffic shift stage.

Thumbnail 950

Thumbnail 960

Now that is done, let's push the change in the UI service and see how this works. So again, we'll change the background one more time. It will go through the same CI/CD pipeline that we have seen before, and we'll go through the deployment stage now. We will jump to the ECS console. It will progress through the different stages that I talked about again. The same browser window setup: the top is connected to production, and the bottom is connected to the test listener. We are at the scale up stage, so the new version is being provisioned.

Thumbnail 970

Thumbnail 980

And now we are at the test traffic shift stage, so the test listener will be changed to route to the new version, and you see the background at the lower bottom is changed. Then we get to the post-test traffic shift stage. Now our lifecycle hook will be invoked, and a notification will be sent based on the logic that we have in the Lambda function.

Thumbnail 990

Thumbnail 1000

Thumbnail 1010

Thumbnail 1020

The user, or the reviewer, will test this new version on the test listener. Based on the results of this verification, they can either approve or reject. In this case, they will approve. So what this will do is it will set the deployment state or the approval state to approved in the S3 packet. And then when the hook is invoked, it will return success, and this will make the workflow progress to the production traffic shift stage. You see the new background is now showing under the production listener.

Thumbnail 1030

Thumbnail 1040

Thumbnail 1050

Thumbnail 1060

We get to the bake time stage. We're not rolling back. We'll just wait for it to lapse, and then we get to the cleanup stage. We delete the old version, and the deployment has succeeded. OK, now I would like to discuss with you a few key takeaways. The first of these is that there are different deployment strategies that are available to you. One of these is, of course, the rolling updates that we discussed, the blue-green that we discussed as well, and there are additional strategies like Linear and Canary that are variants of the blue-green strategy.

Thumbnail 1090

The main benefit you get with the blue-green deployment strategy is reduced risk. You have these two environments maintained during the deployment, allowing you to roll back if any issue is discovered. There are some trade-offs between the rolling update and the blue-green. With the rolling update, the total capacity utilized during the deployment can be lower, but the drawback you get with that is rollback can be a bit slower. However, with the blue-green deployment strategy, you are using more capacity because you have these two parallel environments, the blue and green, but you get this quick rollback.

Thumbnail 1110

Thumbnail 1130

Now a call to action for you. These are some additional resources about blue-green. Feel free to explore these resources afterwards. I will just leave the slide for a couple of seconds for you to take a photo. OK, and we have two sessions about blue-green happening today. The first of these is a breakout session here in Mandalay Bay. This is starting at 1:30 PM. So if you wanted to know more about blue-green, please join this session.

Thumbnail 1150

Thumbnail 1160

The other one is a workshop. This is happening at 4 PM at MGM, and in this workshop you are actually going to build the demo that we have shown to you, and you're going to even add more features to what we just explained. OK, that's all that I wanted to cover. Thank you very much for your time.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)