DEV Community

Cover image for AWS re:Invent 2025 - Deep Dive: ECS Managed Instances & Blue/Green for Resilient Services (CNS416)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Deep Dive: ECS Managed Instances & Blue/Green for Resilient Services (CNS416)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Deep Dive: ECS Managed Instances & Blue/Green for Resilient Services (CNS416)

In this video, Maish Saidel-Keesing and Malcolm Featonby present Amazon ECS Managed Instances and deployment strategies for resilient services. Malcolm explains how ECS Managed Instances bridges the gap between EC2's flexibility and Fargate's simplicity, managing infrastructure tasks like auto-scaling, patching with Bottlerocket AMI within 30-day cycles, and host replacement while allowing instance type selection. The service uses spread placement by default, optimizes for cost through bin-packing on larger instances, and leverages image caching for faster task launches. Maish details three deployment strategies—blue-green for speed, linear for conservative gradual rollouts, and canary as a hybrid approach—all featuring deployment lifecycle hooks for custom validation at each phase. Both features embody AWS best practices, with ECS launching over 3 billion tasks weekly and serving 65% of new AWS customers.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction to ECS Managed Instances and Blue-Green Deployments

Good morning everybody. If you haven't got your earphones on, it would be a good idea so you can hear us properly and we won't have competition from everybody. Just give me a thumbs up if you can hear us. Perfect, thank you very much. Welcome to a deep dive session on ECS Managed Instances and Blue-Green Deployments for ECS for resilient services. My name is Maish Saidel-Keesing. I'm a Senior Developer Advocate with the Developer Relations team, and on stage I have with me. Hi everyone, my name is Malcolm Featonby. I'm a Senior Principal Engineer with the Serverless and Containers Org. Super excited to be here.

Thumbnail 60

This is a 400 level session, which means we are assuming you know what a container is and you know what ECS is. If not, I'm sorry, this is going to be very deep in the weeds. But even if you haven't, we'll give a slight interlude on what we're going to be doing today. Our agenda is we're going to have a short overview of what Amazon ECS is for those who are not familiar but do know what containers are and have been using other container orchestrators. Then I'm going to hand it over to Malcolm who will give us a good run through the details of what Managed Instances is and what we've built over a long period of time. Afterwards, I'll come back and we'll dive into how software deployments work and how you can use those properly to gain the confidence that you can deploy your services in a resilient and available way.

Thumbnail 100

Amazon ECS: Scale, Adoption, and Customer Trust

Amazon ECS is a big service. We are what we call a tier one service, which means we are a prerequisite for any new region built within AWS. It has to be there because we are a tier one service, which means a lot of the services built in each of these regions rely on ECS to be there in order to run. We launch more than three billion tasks every single week across all of our regions globally. That's a big scale of service and a significant amount of traffic at which we operate.

More than 65 percent of our customers starting with AWS actually begin with using Amazon ECS. If you already know how to use Amazon APIs, what the concepts are, and what the constraints are, you probably really know how to use ECS as well, because it's just another compute option similar to EC2, just running containers. You don't have to know any specific systems or any kind of architecture different from how we build the cloud. Just as an anecdote, during Prime Day we had more than 18.4 million tasks launched using AWS Fargate in order to support all the traffic which ran through all the different services within Amazon Prime Day as well. Customers love it.

Thumbnail 190

Thumbnail 210

There are a couple of customers here, but it goes across the breadth and depth of all the different verticals and different industries today. Customers of all sizes are using ECS. These are just a couple of them on screen, but there are a lot more. The reason is because it's a very simple container service and allows them to focus on what we call the most valuable thing: providing value for their own customers. They don't have to worry about the infrastructure in the background.

But it's not only customers. A lot of the services internally in Amazon are built on ECS. Here are a couple of examples: we have Amazon SageMaker, Amazon Lex, Amazon Polly, AWS Batch, and also the Amazon.com recommendation engine for the store. All of these are built on top of ECS to provide you those services when you consume them, either in the Amazon.com store or through the AWS console. The reason is because it's a service which is built for security, for reliability, for availability, and of course for scale.

Removing Undifferentiated Work Through Service Innovation

We released something in September, which was Amazon ECS Managed Instances. I'm going to hand it over to Malcolm to start diving into a little bit of what we did. Thank you so much, Maish, and thank you all very much for taking the time to come and see us, especially given that you chose us over other sessions. We think you made a good choice and we really appreciate that.

Thumbnail 270

I have the distinct privilege of being an engineer working in the serverless container space and working with the ECS engineering team on an ongoing basis. One of the things that we spend a lot of time thinking about is how we can deliver what we refer to as features and capabilities that can remove what we refer to as undifferentiated work from your development teams. Really what that means is we want to make sure that we're delivering capabilities which allow your teams to focus as much of your energy as possible on delivering the thing that makes your product, your service, your business special for your customers.

We want to take on as much of the responsibility and burden of work that is important but not critical to your business. Things like managing the underlying compute and ensuring that your EC2 instance is healthy, along with the infrastructure aspects of that. We also help you with the complexities of deployment, software development, and software application lifecycle management.

The team spends a lot of time thinking about what features we can deliver to provide that behavior. This year, we've delivered a number of features, but the two that I'm really excited about are managed instances in the infrastructure space, and the set of features we've delivered in our deployment space, which Maish is going to talk to you about. When Maish and I were planning what we would speak to you about, we thought it would be useful to explore and give you insight into both how the service works and why we designed it the way we do.

The intent here is to make sure that we're baking in the undifferentiated work that we're taking on with the best practices that we have learned over time. At this point, ECS has been around for 11 years, which we're excited about. Given the number of applications and services within AWS and Amazon, as well as the set of customers we serve, we've had an opportunity to learn a great deal. We try to bake a lot of what we see as best practice into the services and deliver that to you in a way that's easy for you to consume.

Thumbnail 420

Understanding ECS Compute Engines and the Shared Responsibility Model

We're going to dive into some of these things and work through how this works, then point out the areas where we're delivering best practice in these particular feature sets. In terms of infrastructure, ECS manages infrastructure using a concept that we call a compute engine. When it was first launched 11 years ago, ECS delivered the first compute engine, which is ECS and EC2. ECS and EC2 is about bringing your own EC2 instance. We orchestrate the application itself, but you bring your EC2 instance and you manage that.

In 2017, we delivered our serverless offering with ECS Fargate, and that's become very popular with our customers. Fargate is on the other end of the spectrum. If you imagine that ECS and EC2 is about bringing your own capabilities and your own EC2 instance, then you have to think about managing that particular instance. On the other side of the continuum, with ECS on Fargate, Fargate takes on the responsibility and burden of the underlying compute for you and delivers it entirely in a serverless way, which allows you to focus exclusively on your application.

You can think of these as two sides of a continuum. On one side, you have a high degree of control and flexibility, but as a side effect, you also have a high degree of burden. You have to think about how to scale, how to manage, how to patch, and so on. That's ECS and EC2. On the other side of the spectrum with ECS and Fargate, a lot of that complexity is taken away from you. But as a trade-off, and this is always true with managed services, when you delegate responsibility to the managed service, you're also giving up a degree of control.

In order for us to build services that deliver an expected outcome based on the behaviors we need to apply to that service, we need to make assumptions about how that service works. We need to constrain, at least to some degree, the amount of flexibility you have. You can imagine these two things as a continuum, and the way we talk about that in AWS is the shared responsibility model. The shared responsibility model is really the partnership between AWS and yourselves, where we agree which pieces of the stack, going from the bare metal all the way up to your application, you're looking after and which pieces we're looking after.

Thumbnail 570

If you look at ECS and EC2, the shared responsibility model is about us looking after the underlying infrastructure and compute. We make sure that you have EC2 instances that are healthy and that you have those EC2 instances in the availability zones that are needed. We provide you storage, networking, monitoring capabilities, and the ECS control plane, where the control plane is the management layer that's actually doing the provisioning of your application. Above that, in terms of making sure that you're auto scaled, making sure that your availability zone distribution and your spread and placement are correct, launch templates, instance types, and all of that functionality and behavior, as well as your application, all becomes your responsibility.

Thumbnail 630

Thumbnail 650

That can be incredibly powerful because it gives you flexibility. But it also means that when we go back to what we were discussing previously, our desire to give you options to hand off undifferentiated work, this is taking on a bunch of that undifferentiated work. On the other hand, we have Fargate. On that continuum of shared responsibility, Fargate actually offloads a lot of that functionality to AWS. I think that's why many customers really enjoy the Fargate experience. You can focus exclusively on your application, and we take on the responsibility of making sure that you have the server. In fact, you don't even have to worry about the fact that there's a server because you're really just focusing on the task, on the ECS cluster, ECS service, and ECS task.

Thumbnail 690

Introducing ECS Managed Instances: Bridging EC2 Flexibility and Fargate Simplicity

What our customers tell us—and we love getting that feedback from you—is that they love the degree of flexibility you can get on ECS and EC2, but they're not super excited about taking on that additional responsibility. On the other hand, they love the simplicity of management in Fargate, but they would really like an opportunity to get some of that richness, that experience, and that diversity of choice on ECS and EC2. In order to respond to that, we've delivered ECS Managed Instances. We're really excited about this offering. It was delivered in September, so it's fairly new. We're very keen to get you to have a look at it, give it a go, and then give us that feedback.

Thumbnail 760

In terms of shared responsibility, what ECS Managed Instances looks to do is deliver the EC2 flexibility in terms of choice, but the Fargate experience. If you look at what we're doing, we're basically taking on a bunch of the responsibility that we would previously have offloaded to you in terms of still providing that compute, the availability zone, the network, the storage, and the ECS control plane that's managing your application provisioning. But on top of that, we're also managing things like compute auto scaling, distribution and placement. We create launch templates for you, and we're managing the patching of your EC2 instances. But we're giving you the flexibility to make choices about what compute you need.

Thumbnail 840

The really the intention here is to deliver that sweet spot between the ECS and EC2 experience and the Fargate experience, and we believe that we've got that right. We're waiting for your feedback to make sure that we did. The way that ECS manages and provisions compute through these compute engines into your cluster is through capacity providers. A capacity provider is a configuration that you describe, which talks about the kind of underlying compute that you need, and then you associate that capacity provider either with your cluster or with your ECS service. Or if you're taking control of the scheduling and you're using a run task, you can also provide that capacity provider through the RunTask API.

Thumbnail 880

Today, previous to the launch of ECS Managed Instances, ECS delivered two capacity providers. There was the ASG, the managed ASG capacity provider, which is effectively managing the auto scaling group that you bring and leveraging that auto scaling group in order to provision compute into your ECS cluster as EC2 instances. And then every cluster comes with a default capacity provider, which is the Fargate capacity provider, which you can select. That is the default experience. So if you're launching a task into a cluster without a capacity provider, typically you're going to land on Fargate. With Managed Instances, we've introduced a new capacity provider called the ECS Managed Instances Capacity Provider.

That sits between the ECS managed auto scaling and the ECS Fargate capacity provider. It's an additional capacity provider type. These all work together if that's your choice. One of the things that we really want to make sure that we do well at AWS is give you choice. We acknowledge that there is no one-size-fits-all solution. All of you bring value to your customers and to your business because of the diversity of the solutions that you have. So we want to give you the choices that you need in order to make sure that we meet you where you are. That's really why we deliver this set of choices.

Configuring ECS Managed Instances Capacity Providers

With ECS Managed Instances, we really wanted to optimize for that Fargate experience. But keep in mind that there is this continuum, and we're somewhere in the middle. So there's always going to be slightly more configuration than the pure Fargate experience where you're just configuring your application. The way that you go about configuring an ECS Managed Instances capacity provider is you'll provide to us, as a minimum, two IAM roles. The first role is the instance role.

This role is associated with the EC2 instances that we provision and gives those instances the permissions they need to communicate with the ECS control plane and management layer. It allows the ECS agent running on that instance to talk back, get the configuration it needs for your cluster, register with that cluster, and work with the scheduler to provision tasks.

The other role you give us is the infrastructure management role. This role gives us permission to take responsibility for managing these EC2 resources in your account. ECS Managed Instances is built on top of EC2 Managed Instances, which is a capability that EC2 delivered that allows a managed service like ECS to manage EC2 instances in your account. When ECS Managed Instances provisions these EC2 instances, you can see them when you describe them or call describe instances. You can see them in your list and in the EC2 console. We really like that transparency and want to make sure you can see what we're doing. We're not hiding anything here.

When you look at those instances, you'll see that they are tagged as managed by ECS, and you can exclude those from having to worry about them because we're taking care of all of that. When you provision your capacity provider into your cluster, one of the things we've focused on, and we do this with Fargate as well, is that your capacity provider is going to be using spread placement by default. We believe spread placement is the right approach because it allows you to maximize your availability.

Thumbnail 1040

The way you do this is the same way you would with your ECS service when provisioning your ECS service for Fargate and specifying the subnets you want. Each subnet specifies the underlying availability zone where the task will be launched. With the ECS Managed Instances capacity provider, you're specifying a set of subnets that we're going to use to provision EC2 instances into those availability zones. We will use that to do spread placement. Once you've created your capacity provider, given us those two roles, and associated it with your service or your cluster, at that point you're just talking services. It's just like Fargate. You create your ECS service, your task definition, and provision it. Based on what you need, we'll go to EC2 Managed Fleets, get the capacity you need, provision it into the cluster, and you're good to go. We'll start landing tasks on it.

Thumbnail 1060

The way we make decisions about what EC2 instance to use is based on what the vast majority of customers end up using when they're using EC2 Managed Instances and also with consideration for what we're provisioning under the hood for Fargate. We've come up with a list of defaults. Unless you actually have a need for specialist compute or need to be more prescriptive, you can just leave us to take care of what instances are available. We're typically going to choose from any one of the M or R families, which are the general purpose compute families. We're going to make sure we're making that choice based on your workload, and we'll get into exactly how we go about making that choice.

Thumbnail 1100

However, if you do need more control, and this goes back to the idea that we're sitting on a continuum and want to make sure we're giving you the flexibility you need, ECS Managed Instances allows you to capture that as a set of attributes using attribute-based selection. If you're familiar with EC2's auto scaling groups and fleet configurations, then you'll know that attribute-based selection is a mechanism you can use to describe the kind of EC2 instance you might want without actually specifically stating that EC2 instance. You can specify that you want a specific C6 48 X large if that's what you want, but you can more generally describe the kinds of compute you want.

The reason that's really powerful and why you should try to be as general as possible is because it gives us at ECS the opportunity to make sure we find the capacity you need through a broader, more diverse selection. The key here is really that you can be as prescriptive as you want with the compute you want, or you can be as general as you want, or you can just leave it up to us. That really goes to that shared responsibility continuum. The more prescriptive you become, the harder it can be for us to make sure we're getting you the lowest cost, most available instance.

In some workloads, that's absolutely required. It may be that you need a GPU and you specifically have to target that particular instance type. You can do that with ECS Managed Instances. At the same time, if you really just have a general requirement for something that's network optimized, you probably want to try to be as broad as possible because the benefit you get is that it offloads the decision making to us.

Then what we'll do is we can actually go to EC2 and make sure that we're getting you that compute. So really the rule of thumb here is, as you're using this, try and be as general as you can be, because that maximizes the availability posture of the service. It gives us a much better chance of getting you the compute that's available at any point in time.

Thumbnail 1230

Task Definition Configuration and Instance Sizing Decisions

The way that we go about making decisions about how to size the instance that we need and how to get that compute for you is we use the task definition configuration. So unlike with Auto Scaling Groups and managed scaling groups where you're actually specifying the EC2 instances and then we're just placing the tasks, here we're actually using the task definition to be able to decide what EC2 instances to get for you.

The way that works is, if you think back to how task definitions are configured, basically the way this works is that at a container level, keeping in mind that you can have multiple containers in your task, you can specify as a minimum the memory reservation for that container. The memory reservation for the container is the minimum amount of memory that is reserved for that container. It doesn't prescribe an upper bound; it's just saying I need at least X.

You can then add to that the memory limit, which is the memory that you need for the container and represents the upper bound. Now you're saying this container needs to be bounded to this amount of memory. Also at the container level, you can specify the CPU you need. The thing with the CPU that you need is it's not a hard cap of this amount of CPU for the container. The way that this is configured is it becomes a ratio of the CPU available to all of the containers in that task. It's ratioed across all of those containers.

That's useful if you have containers that are bursty and they need to consume more or less. It also means that you can make sure that one of your containers, perhaps your primary container, is getting eighty percent of the compute and maybe your logo or whatever it is, is getting less of that compute. At the task level, you can specify memory and CPU and that actually then, this is how Fargate works. That is actually specifying exactly how much memory and CPU will be allocated to that task. It's effectively a hard bound.

The way that we go about doing this is when you configure your ECS task for ECS managed instances, we walk through the set of configuration that you have and we decide based on that what it is that we can go and get for you. As a minimum, you have to specify memory reservation because we will then go and find an EC2 instance that's going to fit at least that. But you can specify any one of these things and it's valuable to actually try and move higher up the stack. The less you can be prescriptive about what you're getting, the more elastic you end up getting in terms of the underlying compute.

The difference between Fargate, where you're getting that compute, and managed instances is that with managed instances, these workloads are multi-tenant. Multiple of your tasks will be running on the same EC2 instance. That's valuable because it allows you, depending on how you specify this, to allow those workloads to burst from one to the other.

I was chatting to a gentleman earlier, and he was telling me about Java workloads and the fact that they run a lot of Java. A lot of Java, especially when it's doing just-in-time compilation, can burst CPU quite high when it first starts out. What we found with customers is that with Fargate, the end result is you actually have to specify the CPU need at P100, effectively the maximum CPU you need. With using ECS managed instances, we can leverage the elasticity of the underlying compute so that you can actually specify less memory. You can specify your aggregate memory that you need, and it'll allow you to burst to handle that JIT compilation.

Optimizing Task Placement for Speed and Cost Efficiency

Let's talk a little bit about how we go about making the decisions as to what instance to launch. One of the things that we really index heavily on is, in order to make sure that we're delivering you the most resilient service, we want to make sure that we can provision your workloads as fast as we possibly can. That's our goal. Effectively, as your service scales up, we want to be able to make sure that we're launching those tasks quickly because we want to get the workloads there to make sure that they're meeting your customer demand.

So the way we go about doing this is at any point in time, keeping in mind that a big cluster, depending on the size of your clusters, large clusters are very active. There's workload going on all the time. There's multiple services, there's scaling up and scaling down. At any point in time, the ECS scheduler will be looking to see what work needs to be done. It takes a snapshot of all of the tasks that are currently in a pending state, basically tasks that need to be provisioned onto workloads.

Thumbnail 1420

It's going to index on making sure that we first use the compute that's already there. The reason for that again goes to, we want to make sure that we're launching your tasks as quickly as we can. So if we can fit the tasks that you have on the instances that you have while meeting your placement constraints, making sure that we're spread across Availability Zones, making sure that we're applying whatever placement constraints you've applied to distinct instances, we will make sure that we're first landing on the compute that's there. What we do next is we then look at all of those tasks that we were unable to place.

We examine the placement constraints that have been applied to those tasks, including their need for availability zone spread. Based on your capacity and your managed instances capacity provider specifications, we will retrieve all of the EC2 instances that meet that requirement. We then perform what we call first fit decreasing, where we walk through the set of tasks that need to be placed and look for the largest EC2 instance that meets those requirements in order to place it. Our goal is spread placement first. The first phase is to place on what we have. The second phase is spread placement, and then the third phase is going to be bin pack. The reason for that is because we want to make sure that we're optimizing cost for you. We want to make sure that we're finding the largest instance at the best possible price and putting as much onto it as we can.

Thumbnail 1570

Let's move on and talk a little bit about how that can be optimized. The benefit that we get in using these larger instances and in using ECS managed instances is that you're actually reusing that underlying compute. That's incredibly valuable because it goes to what we were saying earlier about the fact that we want to launch quickly and get you running quickly. There are two upsides there. The first is that you've got this elasticity that we talked about because you've got the larger instance, so you can burst into it if you need. Because you're landing all of your workloads on this instance, the first workload will pull the image, and the subsequent workloads are going to use that image cache. If you had multiple of the same tasks landing on the same EC2 instance, those subsequent tasks are actually reusing the image that's already there. It saves you a bunch of time. You're not pulling that image cache every time. This significantly improves throughput, and that's really why we index hard on the combination of cost and throughput that really leads us to index quite heavily on making sure that we're choosing a larger instance type.

Thumbnail 1640

Host Replacement, Rebalancing, and Cost Optimization Workflows

One of the things that we will be doing in your cluster on an ongoing basis is host replacement. Remember, we talked about the fact that there's a shared responsibility model, and part of this offering is that you've offloaded to us the responsibility of making sure that your host is patched and that your host is compliant. We're going to be looking at opportunities to make sure that we're meeting that compliance requirement. At the same time, there are a bunch of other things that could be going on that actually drive the need for replacing the host. The first is that as an operator, you may decide that you don't particularly want this particular host. For some reason, you decide that you want to get rid of that one, and so you can actually drain the host, select it, deregister or force deregister, and then we would run it through what we call a drain cycle or a host replacement cycle.

Another driver is maintenance events. ECS is continually monitoring to see if there are EC2 maintenance events applied to that particular EC2 instance so that we need to replace it, or maintenance as a result of patching, where the operating system needs to be patched, and so we'll be applying those. There's auto scaling, so your service is continually scaling up and scaling down, and as you scale up and scale down, we need to make sure that we're taking corrective action. Then there's cost optimization. We're continually looking for opportunities to make sure that we're giving you the best possible value. We don't want idle instances sitting around.

Thumbnail 1720

The way that this works is that effectively, at the point that we get a notification that an instance needs to be replaced for whatever reason, we enter the draining lifecycle and the first thing that the scheduler is going to do is replace all of the tasks that are managed, running on that instance. We always make sure, going back to resilience, we always start a task before we stop a task. It's a golden rule that we have. We have this conversation with the scheduling team all the time. It is the golden rule. We start things before we stop things to make sure that we're always meeting your minimum healthy requirement for your service. So we're always meeting your resiliency goals.

Thumbnail 1790

The way that this will work is the scheduler is going to first launch a bunch of tasks. Now, let's assume in this particular case, there isn't enough compute in your cluster. Those tasks are going to go into pending, and we're going to go back through that workflow that we were talking about earlier where we're going to look for the capacity and the compute that we need, find the compute that meets the placement requirements for these particular tasks and provision that into the cluster. We're then going to provision those tasks onto this EC2 instance. Now, you'll notice here that we've provisioned the orange and the green tasks, the service tasks, the managed tasks. What we have not done is provisioned the unmanaged task. So if you launch a task onto ECS using run task or start task, then the way that we think about that is when a task is marked as unmanaged, that's a mechanism to tell us that your service is taking responsibility for scheduling. Rather than using our scheduler, you want to own the scheduling through some external process.

We don't manage those tasks, but we will make sure that those tasks are the last to be deprovisioned, and we will always honor the task shutdown grace period and any mutex functionality, including task scaling protection that you've enabled. If an unmanaged task is currently in its shutdown grace period or in a scale protection phase, we won't take corrective action until it exits that phase. Unmanaged tasks are also the last things to be deprovisioned off the instance.

The first thing we will do is find new tasks for you. The scheduler will replace the managed tasks on the new capacity, then tear down the managed tasks, and finally, honoring these constraints, tear down the unmanaged tasks. Once we've done that, we'll tear down the EC2 instance and get back to a point where your processes and services are up and running, and we've removed any idle compute that you no longer need.

One of the things you might notice in your services is that we're optimizing for throughput, so we really want to make sure we're launching quickly. As a side effect of that, when we launch, the distribution of compute that you have in your cluster may not align with the distribution of pure balance that you'd need for your ECS service. Your service can end up in an unbalanced state for a short period of time. ECS is continually looking for opportunities to rebalance. Just before re:Invent last year, we launched a feature for ECS services called continual rebalance, where you can select for that service that you want it to be rebalanced. This is now a default behavior. We work to bake in default best practices into the service, and this is a best practice. For any new service you create, it will automatically rebalance.

What happens there is that the scheduler is looking for these opportunities. It will identify that a particular service is unbalanced and then take corrective action to rebalance that service by launching a new task first into the appropriate availability zone and tearing down the previous task. That will kick off that workflow if you need new compute. In that case, we're going to get an EC2 instance and make sure we put it right where it needs to be in that availability zone so we can provision onto that instance for you and make sure you're continually balanced. As a result, you'll end up in a world where you're continually meeting that availability requirement. Your tasks are always appropriately spread across the availability zones that you've configured.

One of the things we watch for on an ongoing basis is a notion we call idle hosts. Idle hosts really just means making sure we don't end up in a position where we've deprovisioned a bunch of tasks and we've got a bunch of EC2 capacity sitting around, because that's wasteful and wastes your money. The way we do that is the ECS scheduler is continually monitoring for stop events. Anytime a task is terminated, we know there's an opportunity for us to go and apply some optimization. As soon as we see that event, we'll do a sweep of the entire cluster. We'll identify opportunities where we think we can do a better job and move tasks to make sure we're packed appropriately, honoring spread and placement constraints, then deprovision those instances, again honoring stop before stop.

The other thing we will be doing is looking for opportunities to downscale. If you're running on a larger instance, many workloads may be diurnal, so you scale up during the day when you need larger instances and more capacity. But as you get to the end of the day, your service scales in and customer demand has dropped. During that case, those EC2 instances are no longer needed. There are two ways to do that. We may be deprovisioning, or we may actually just scale down. We would choose a new instance type for you and actually move you onto that smaller instance type in order to save you money. Let's talk a little bit about how we do patching.

Automated Patching with Bottle Rocket and Choosing the Right Compute Engine

One of the things we talked about is that we want to take on the responsibility for patching. Our customers have told us they want to make sure they're compliant, but keeping your EC2 instance patched and your Amazon machine image patched is hard work. There's a lot that goes into that, and it generally doesn't tend to be differentiated work for many businesses. It's not the thing that's going to make you shine. It's just busy work, albeit incredibly important work. So we want to take on that responsibility.

One of the key features that ECS Managed Instances delivers is this patching workflow. The way this works is we want to make sure that within a 30-day period, we have patched all of your capacity. On day zero, at the point that you provision your task, we're going to give you a brand new EC2 instance. We always favor provisioning new instances. We never patch in place. We always provision a new instance because under the hood, EC2 has its fleet of available capacity against which it is continually running health checks. Getting a new EC2 instance means that health checks have recently been applied to it, so it's going to be a good, healthy instance. We're not going to provide one that is unhealthy.

We're going to give you a healthy instance and then give you the latest Amazon machine image that we're using. ECS Managed Instances is built on Bottle Rocket. Going back to that shared responsibility model, with ECS Managed Instances, you don't get to bring your own Amazon machine image. We deliver the AMI because we need to make sure that we understand how that AMI works. We need to be able to patch it, deploy it, and run your workloads on it in a way that ensures you're going to get the outcome that you need. If you do need to manage your own Amazon machine image, then ECS and EC2 is probably a good fit for you. But in many cases, customers don't need that responsibility.

Thumbnail 2240

The Amazon machine image we're using is Bottle Rocket, which is AWS's container-optimized operating system. We delivered it, and in fact, that team works very closely with our ECS container and EKS teams to make sure we're continually delivering the most optimal configuration we can. Under the hood, we're going to be giving you the latest, up-to-date Bottle Rocket image. We're going to deliver that on day zero, and then 14 days after that, we're going to start looking for opportunities to patch that. We want to fit into this 30-day patch cycle. We want to make sure we've always replaced all of your AMIs within a period of 30 days because that really helps ensure we're meeting your compliance requirements for those of you in regulated industries that have that pressing requirement.

Thumbnail 2320

On day 14, we start looking for opportunities to optimize and replace these instances. One of the things we do is ECS Managed Instances honors the EC2 maintenance window. For those of you that have used EC2 extensively, you've probably encountered EC2 maintenance windows. It's a mechanism you can use to configure and tell us when is a good time for us to take corrective action for your underlying compute. As a minimum, you need to give us two windows within a one-week period where we can take that corrective action. ECS Managed Instances is leveraging that configuration. You can configure through EC2 maintenance windows when you want us to take this corrective action, and we're only going to take the corrective action during that period. You can optimize for weekends or downtimes or whatever works best for you, your low traffic points in the day.

At that 14-day period, we're going to give you a brand new instance and a brand new AMI. It's possible that over that period we haven't got to everything, although under most circumstances, we will have completed that patch cycle. It's also important to keep in mind that this patch cycle isn't everything in your cluster on launch day zero or day 14. It's an ongoing process. We're continually evolving and continually making sure that your cluster is okay. You'll see a fair degree of mutation in your cluster as we're taking this corrective action. It's not high mutation, but there's an ongoing background mutation as we manage that entropy for you.

Thumbnail 2400

We're taking on the responsibility of making sure we're keeping you compliant. At day 21, we start to get a little more aggressive because we've made a commitment to you that we will keep you compliant within this 30-day window. At day 21, we're actually going to start taking corrective action where we'll do effectively eviction. We'll find the instance that needs to be patched and then move workloads off it and patch it. Typically, we're going to be honoring maintenance windows between day 14 and day 21, so it's very unlikely that we wouldn't. But at day 21, we need to start being a little more aggressive because we have to replace that AMI. Having completed that cycle, we then get to a point where you now have a clean, healthy cluster. All of your instances are replaced.

Thumbnail 2420

And you're currently running the latest and greatest of Bottle Rocket. So one of the things that you might be asking yourself is, well, that's great. There's like this continuum and I've got all these decision points. Now I've got a bunch of things to decide. When would I use one versus the other?

Our recommendation is that more often than not, most customers are best served by Fargate. Typically, your workload, especially if it's a stateless workload that is CPU and memory bound, you can get everything that you need from Fargate. That's where we would recommend that you start. It's simple, it works, and customers love it.

If you find that you have a need either to have a larger instance or a larger task size, Fargate has upper bounds in terms of the amount of vCPU and memory that it will configure, and you need something bigger, then ECS Managed Instances is for you. Equally, if you need control over the underlying compute, if you really want to make sure that you are choosing that particular instance type or that family of instance types, or if you need accelerated instance types and GPUs, ECS Managed Instances is a really good fit.

It's the sweet spot between the Fargate experience and the ability to own and manage the configuration that you need in order to get that compute. For those customers that need to manage the actual operating system effectively, you want to be able to bring your own AMI, you want to be able to tune the kernel or whatever it is, ECS on Managed Instances is for you. That is probably the small minority. There are very few cases where that is needed, but if it is needed, you have that option.

We're thinking about this in the following way. We think a lot of customers today that needed that additional functionality that Fargate wasn't quite able to deliver in terms of the instance type selection and are currently running on ECS and EC2 will find that ECS Managed Instances is probably a great fit, and that migration is fairly seamless. We'd encourage you to have a look at that.

At the same time, those customers that are running on Fargate and get to a point where you feel like you need more compute than Fargate is able to deliver, or you need something that's optimized for network or optimized for storage throughput, ECS Managed Instances is a great fit for you as well. So it's sort of the sweet spot between those two.

ECS Deployment Strategies: Blue-Green, Linear, and Canary Approaches

With that, I'm going to hand over to my friend and colleague, Maish. Thank you all. Thank you very much, Malcolm. Everybody hear me okay? Perfect. So we've been talking about infrastructure up until now, and I want to move over to the software part, which essentially is again on that continuum of the shared responsibility. This actually is your responsibility. We still, as part of the shared responsibility model, provide you with the infrastructure underneath that allows you to continue to provision your software as part of your application which you're providing.

Thumbnail 2590

In November, we revamped the way we do Amazon ECS deployments. The reason, of course, is because of feedback from our customers. They had some challenges with the way we were doing it because it was dependent on an external service called CodeDeploy, and we wanted to give you a more streamlined experience. Just at the beginning of last month, we provided that experience which is called ECS Deployments with three different options for you to deploy your services and upgrade your applications running on ECS.

Thumbnail 2620

Thumbnail 2640

Blue-green deployments, linear deployments, and Canary deployments. We're going to go through each one of these in some detail to explain exactly what it is. But first, an overview for those who don't usually do this for their day-to-day life. A blue-green deployment is when you have a blue service running in production across multiple availability zones. You would like to upgrade the service. In that case, you provision a new version of that service across the availability zones, meeting your requirements, and when you're ready, you take that service and flip it between a load balancer to the new one and direct all your traffic to the new green service, and when you're ready, remove the previous one in blue.

Thumbnail 2660

Thumbnail 2690

It's a simple process, but let's go ahead and tell you in a little bit more detail into how this actually works. There are six different stages of a deployment in Amazon ECS: the preparation stage, deployment, testing, traffic shift, and monitoring and completion. In the preparation stage, we go back to the fact that you have one version, which is blue, running 100 percent of your traffic and 100 percent capacity in your ECS cluster. When you start a deployment, we will start to pre-scale up. The pre-scale up in this case will not provision any resources, and these are very distinct phases within the deployment process, and you can get status on each and every single one of these different phases.

We will provision the infrastructure underneath, which in this case means preparing the routing rules in the load balancer and the target groups, but we will not yet start to provision resources. This is the preparation phase.

Thumbnail 2720

The deployment phase has two different things. When we start the deployment phase, we will start the scale up of the green version, where we will provision 100% of the capacity. Again, it is always 100% of capacity that we are provisioning in the background. Of course, we want to provide you the capability of always having enough capacity to move over when you are moving to a new version. After the scale up, we will verify that the green service has been started and that 100% of the traffic is still being routed to the blue version with no traffic being moved over to green.

Thumbnail 2760

During the testing phase, we will validate that the green environment is able to actually accept traffic by injecting some test traffic into the load balancer, which will start verifying that it is working and able to respond with all the health checks passing, and your application is behaving and performing the way you expected. You will see at the bottom there is a smaller Lambda icon because what we have tried to provide for you is the ability at every single stage of the deployment or phase of the deployment to run your own tests. These are done in what we call deployment lifecycle hooks.

Thumbnail 2810

The deployment lifecycle hooks is essentially a Lambda function which you can configure or define as part of the deployment to specify what you want that Lambda function to actually test or verify is working. It is essentially just code, like any other Lambda code, where you will provide the criteria for what passes, what is currently in progress, and what fails, and what to do. Once it either passes that phase or fails, it will automatically roll back or continue to the next phase. This can be, for example, to verify that the SHA that you are looking for on your container image matches the one that you are actually supposed to be deploying with no discrepancy, or that the target groups have already been validated and actually exist on the load balancer so that you know everything is working. These are the tests that you can define exactly what you would like them to be.

Thumbnail 2850

For traffic shift, we have verified that everything is working and now we start to move the traffic in a blue-green deployment. We said 100% capacity was running to blue, and we flip it all at once for moving all the capacity over to green. In this case, now zero traffic is running to the blue version, and all 100% traffic is now running to the green version. Again, we have these deployment lifecycle hooks where we can validate that everything is actually working. We then go into a bake time or a monitoring phase.

Thumbnail 2880

In other words, we want to see that everything is performing the way it was. This can be, of course, some kind of performance or maybe a degradation test. Is my application performing exactly the way it should be, like it was before? If there is any degradation in performance, I want to automatically roll back directly to my previous version because it is still sitting there with no traffic running to it. The switch backward and forward is almost instantaneous and allows you to gain the confidence and be very confident that you can actually not have any problems, and if there are, roll back directly very quickly without hurting any of your customers or hitting a performance issue.

Thumbnail 2920

Thumbnail 2940

The last one, of course, is the cleanup phase. The cleanup phase is after we have verified that everything is done. Specifically at cleanup, there is no deployment lifecycle hook because this is the last stage of the deployment. Once we are ready and know that everything is working, we can finally deprovision all the resources which we had on our previous version, and then green becomes blue and we continue the process to the next version going forward.

Thumbnail 2950

For linear deployments, we have a similar blue version which we want to start running 100% of traffic. However, I would like to start slowly and gradually move traffic to my new version. I provision a new version and slowly move my traffic. You can move 25% of the traffic at a time, or you can define however much of the traffic you would like to move. Over time, you move more and more traffic over to my new version until everything has moved, and then you can deprovision my previous version and work with this kind of process on a regular basis.

Thumbnail 2980

Thumbnail 2990

Thumbnail 3000

So let us dive in again. The first part is exactly the same as we saw before: the preparation, deployment, and testing phase. However, it is slightly different on the traffic shift. Let us zoom in over there for a second and have a look at what it looks like. In the traffic shift, we start in the beginning with 100% capacity running to blue and 0% of the traffic running to green. We start moving our traffic. As part of the deployment, we have a definition of a bake time.

Thumbnail 3030

Thumbnail 3040

The bake time is the amount of time where you want the application to sit on both versions to verify that everything is working. With deployment lifecycle hooks, you can validate after every single step of the traffic shift to ensure everything is working correctly. If anything fails, we fail back all together to the original version, so you will not have any degradation of service. The process continues for each amount of traffic, with the load balancer moving more and more traffic to the new version until we finally complete and move all of our traffic over to the new version. This is the linear approach, where we go step by step in order to accomplish this.

Thumbnail 3050

Thumbnail 3060

Thumbnail 3090

The last deployment strategy I want to look at is canary. It is similar to the other two and is a mixture of both. We start with 100% running on our blue version, and we deploy a new service. However, we only move a bit of the traffic, in this case 10%, for example, and we let that sit to verify that everything is okay. Once we are confident enough, we flip everything at once, moving the remainder of the traffic. It is essentially a two or three step process, depending on how you look at it, where you do one migration of traffic or part of it, and once you are confident that everything is working, you flip all the other traffic. Afterwards, you remove the blue service.

Thumbnail 3110

Thumbnail 3130

Zooming in on the actual traffic shift, which is the only different change we have: in the beginning, we have 100% running to our blue version with no traffic going to green. We move some of the traffic, in this case 10%, over to the new version, and we have the bake time. This is where we run your tests and validations to see that everything is working correctly the way you expect it to. We let it sit and verify that data performance is working as it should be. Once you are confident enough and have the correct metrics and everything that you need, then you can flip everything over to the new version and allow you to migrate your service to a new version.

Thumbnail 3140

We like to provide ECS as what we call best practice by design. Everything that Malcolm has been talking about, including start before stop, placement mechanisms, and how we actually do the updates of software on the instances, represents the best practices which we internally use for our own fleets and our own services. We want to provide that option for you as a customer to use those same primitives as well. By provisioning resources in each version, for example, best practices by design, the lifecycle hooks allow you to gain that confidence. We give you that opportunity, and I try to refer back to that shared responsibility model where we are taking on toil and giving you some of it, but we are giving you the options.

For example, you have to validate that everything is working correctly and provide that code. It is that shared responsibility model, but we are giving you the options and the mechanisms in order for you to do that. The bake time is also a best practice internally. We do that for every single one of our deployments, and we are also giving you that opportunity and that primitive to allow you to make use of that mechanism as well to validate your applications are working correctly.

So, as Malcolm asks, we have different spectrums of what you can use. The question that a lot of customers come to us is: when do I do blue-green, when do I do linear, when do I do canary? And as any good IT person will say, it depends. There are benefits of one version over the other, and it will depend on your kind of application. We look at blue-green as mainly optimized for speed. If you want to move over to a new version very quickly, this will be the fastest way to do it. It will, of course, require you to have a robust deployment mechanism and validate that everything is working, because once you switch, you are switching all of the traffic, and that could cause a problem if there is some kind of degradation or bug within your system.

When we look at linear, this is a situation where you want to be a bit more conservative. You want to do it slowly and move your customers or your traffic over to actually gain the confidence. You are not 100% sure how this works, so it will take you more time to validate that the traffic is working and there is no degradation of performance. You want to be a little bit more conservative, but on the other hand, this will take more time. Your deployment time will be longer, and the rollover to a new version of a service will be longer. As a result, because we are provisioning 100% of capacity on both versions, you will be paying slightly more for that amount of time that both resources are sitting up waiting for the actual migration of your service from one version to another.

Canary is a mixture of the two. In other words, you want to do it quickly, but you do not want to wait so long.

So I'll do a bit of my traffic. I will validate that I know exactly, I'm confident enough that if I do a bit of it and it moves and everything is working fine, then I'll move everything. That's what our customers kind of look for. But it's not only based on your own confidence and your own operational practices, it's also dependent on the application as well.

If I give you an example, your application has some kind of LLM application, in other words, where you are updating a version of an LLM inside your application. So if you do a migration from one version of service to the other and start on linear, for example, there could be a chance that the customer experience coming to these different services might be different because the one LLM version behaves slightly different than the previous one. So that might be a reason that you, for example, would like to do something which is more blue-green.

Thumbnail 3390

I want the customer experience to be as standard as possible and not have any deviations where customers could get different answers, for example, from an LLM. Then I would do a blue-green to move as fast as possible from one version to the other. If it is an application which is a simple web application, maybe upgrading one version to another, then maybe that won't make much of a difference. It would depend on the kind of software that you're actually using.

Before we leave, we would like to leave you with a couple of QR codes and resources which we can actually dive deeper into a lot of the topics that we talked about today. The first one is actually the documentation, which is very robust and gives more detail than I did today with how exactly each of the stages are, each of the phases are, exactly what you can do and how you implement that. On the right-hand side is a blog post which we published about a deep dive into ECS Managed Instances with a lot of the information that Malcolm provided today as well, but not all of it.

The last two are from the AWS Builder's Library, which is a website where we have white papers of how we build things in Amazon on a regular basis. Of course, without all the details, but from the concept perspective, these are white papers which are written by our principal engineers, senior principal engineers, and distinguished engineers, the top people in Amazon who build systems which scale to things like 3 billion tasks a week on a regular basis. Both of them are written about how we do deployments internally in Amazon, how you can ensure, for example, rollback safety with deployments.

These are the kind of things which we do on a regular basis at huge scale in Amazon, and we would like to provide that information to you as well, so you can start reading about it and implement that inside your own organizations. I'd like to thank you very much for your time. Our emails are here on the screen. If you would like to take a picture, send us an email. If you have any questions, we'd be more than happy to answer them.

Thumbnail 3530

Before you leave, and preferably before you leave or you can also do it on the way, please don't forget to give us feedback from the session itself in the IDWS events application. We would really appreciate it to hear back from you how the session was, whether it was useful, and any feedback you would like to give us directly about the session. We'd be more than happy to hear from you. If there are any questions, we will be somewhere over there towards the entrance of the hall so that we don't disturb the next session coming, which is already lining up. Thank you very much for your time and enjoy the rest of your day and your event.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)