Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - From code to cloud: Accelerate application development with Amazon ECS (CNS341)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - From code to cloud: Accelerate application development with Amazon ECS (CNS341)

In this video, AWS introduces Amazon ECS Express Mode, a new feature that simplifies container deployment by requiring only a container image and two IAM roles. The session covers ECS's fully managed, versionless architecture and new offerings like ECS Managed Instances. Speakers discuss platform design principles including lifecycle management, economies of scale, and break glass procedures, comparing abstraction versus composition approaches. Express Mode automatically provisions load balancers, auto scaling, TLS certificates, and observability while sharing Application Load Balancers across up to 25 services. GoDaddy's Keith Bartholomew presents their Katana platform built on ECS Fargate, serving 2,000+ engineers with unified dashboards, generative AI support via Bedrock, and push-button resilience across multiple regions. The platform demonstrates practical implementation of escape hatches, CloudFormation hooks for governance, and integration with enterprise systems while maintaining flexibility for diverse developer skill levels.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to CNS 341: Accelerating AWS Deployments with ECS

This is CNS 341. You're in the right place if you are a developer that is looking for faster ways to get your code deployed in AWS. You are also in the right place if you are a platform team or infrastructure engineer looking for patterns or guidance on how to accelerate your teams. My name is Jennifer. This is Tsahi. We are your ECS experts today. We're also going to be joined by Keith a little bit later. He's from GoDaddy and he's built a platform on Amazon ECS that he's going to talk about. Thank you guys. With that, let's get started.

Amazon ECS: A Fully Managed, Versionless Container Orchestration Service

First, to set the context, I wanted to lay the foundation of the service that we're going to be talking about today, and that is Amazon ECS. Amazon ECS is our fully managed container orchestration service that provides the easiest way for organizations to build, deploy, and manage containerized applications on AWS. Even as a compute layer, we offer a lot of flexibility. You can run containers on EC2 instances. You can even run them on your own hardware with ECS Anywhere. But the majority of our customers run on AWS Fargate. Fargate gives you the ultimate simplicity with serverless. We manage the compute completely. You pay for what you use, which means you're not bound by EC2 instance size ratios.

ECS Managed Instances is a new offering that we just came out with in the last month that is a fully managed compute that eliminates infrastructure management while giving you access to a really broad set of EC2 instance sizes. This is great if you, if you're looking for specific EC2 instances like GPUs or network optimized instances or memory optimized instances, while still taking advantage of some of the great aspects of Fargate like the maintenance and the patching and the scaling that we take care of. ECS Managed Instances is a great option when you need that specificity, but you still want ECS to take care of that operational burden.

Continuing on that theme of removing the operational burden, what makes ECS really unique is that it is fully managed and versionless. There is no control plane for you to manage. There are no upgrades of the control plane to coordinate, no patching of the control plane to schedule. We handle all of that operational complexity. This is something that I've been hearing from customers all week that is part of why they really love ECS, is that we handle so much of that on their behalf.

When you create a cluster, it is essentially just a logical grouping. It is not really something that you have to treat as something precious. You can create as many clusters as you want and use them as you want to in order to group services together, because we're managing that control plane and you don't have to think about it. And if you're using Fargate or Managed Instances, you don't need to provision or scale servers. You don't need to manage or patch the operating system. We manage all of that on your behalf.

And specifically with Fargate, you get tenant isolation, and our financial services customers really love this specifically because the security boundary becomes not the container but the EC2 instance. In Fargate, every task gets a unique EC2 instance. You can also achieve this with ECS Managed Instances, but it kind of, you lose some of the value of Managed Instances and the optimizations that we're providing with the bin packing and scaling of the underlying instances.

Native Features and Wide Adoption of Amazon ECS

And then we come to some of the native things that we've built into ECS. So the first one being our native service discovery. This one I feel like is kind of an underrated thing that we have in ECS. We build in service discovery and service mesh, and that is Service Connect is the name of the service. You don't have to install or patch or maintain Service Connect. It is just there and it's available for you to use. You have a unified way of accessing all of your services. You don't have to maintain DNS or any complex management there. It's available for you to set up and access at any time.

Finally, native deployment mechanisms. We've always had a deployment mechanism in ECS. You've always been able to update your services using a rolling deployment mechanism, but we found that a lot of customers are going outside of ECS to use other external deployment strategies like CodeDeploy to do blue-green, for example.

So just this past summer we launched native blue-green strategies, as well as about a month ago we launched Canary and Linear strategies, all natively within ECS. This is really important because by making that native to ECS, we're removing that burden of having to go outside of ECS in order to set up that deployment strategy, but also all of the wiring of the target groups in order to set up the blue and the green side. All of that exists within ECS, and it just makes the deployment of each of those things a lot simpler.

Now all of those things brought together are a lot of why customers adopt ECS today, and when you do adopt ECS, you're in very good company. Over 3 billion tasks are launched in ECS every week. Over 65% of all new AWS container customers use Amazon ECS, and it is very heavily used internally within Amazon. We actually call Amazon ECS one of our foundational services, and the reason we call it that is because every time we stand up a new region in AWS, ECS is one of the very first services that has to go in. The reason is because so many other AWS services build their infrastructure using ECS, so you're in good company when you're using ECS.

The Complexity of Application Deployment: Developer and Platform Perspectives

So that's a little bit about the service that we're talking about today. I want to hand it off to Tsahi to talk a little bit about applications. Thank you, Jen. In the next part of the session, I'm going to cover how actually deploying apps to production actually looks like, but before we get there, we need to understand how the deployment mechanism works. I'm also going to shift in between two perspectives, one of those of the development team, those who are building apps, deploying them into production, and the platform builders, those who are building tools and automation in order to support the development teams in order to get their job easier.

So let's look at the deployment process from the developer's point of view. From the moment we have our container image packaged built into Amazon ECR, the process of deploying it into production can be quite complex and involve many different steps, which I'm going to cover in this slide. It starts with having a networking, so AWS VPC, an Amazon ECS cluster to provision which basically holds all of our applications into it. Then we need to have a way to tell ECS what is the configuration of our app, and we do it through an ECS task definition. That's the mechanism in order to do it.

And then a common pattern of exposing an app is usually through a load balancer, an Application Load Balancer. In order to do that, we need to have to do it securely. We need to have a certificate. We need to make sure we have that, and then we need to integrate it with the ALB itself alongside with all of its resources, so target group, listener rules, routing configuration, and so on. And just now we got to a point where we can integrate this all together, create an ECS service, which basically takes the task definition and creates multiple instances of this task, integrate them with the ALB and then expose the app to the outside world.

But the thing is, our app is not a static entity. Of course we need to have auto scaling policies in place. We need to take care of that. And in order to operate and maintain our app over time, we need to be able to observe it, so we need to ship logs, traces, and metrics into the observability service. In our case, Amazon CloudWatch, and you could do it easily with Container Insights which is integrated with ECS.

Now all of that complex process is just for a single app. If we flip the perspective around and we look from platform builders, we also have some other challenges. There are many different development teams that platform team builders need to support. Each comes with its own application and different requirements. And even if we standardize on the compute layer, so for example, if you standardize with Amazon ECS to provision our resources, each different app might need a different backing service. So for example, some might need an S3 bucket and some might need a DynamoDB table, and we need to be able to scale it across the organization. So it's not just about how do I deploy a single app, it's how do I make sure all the requirements of the different teams are satisfied.

So like any good engineers, we usually automate things.

Three Design Principles for Building Developer Platforms

By automation, you probably understand it is building a developer platform. We need to build a tool, a platform which will help developers get their tasks done easier. Now, like any other system or app in the organization, it has to have design principles, and today I want to cover three important design principles for building developer platforms. Those are lifecycle management, economies of scale, and break glass procedures. So let's dive deep into each of them separately.

Lifecycle management means owning the entire lifecycle of the deployment of the app from the creation through update all the way up to decommission. It's not just about how do I deploy the app, it's about maintaining it over time with updates and upgrades. It also includes some sort of an entry point for developers to be able to interact with the system, and this can be anything from an API, a CLI, or even a collection of templates which I will cover in the next part.

Economies of scale is all about being smart with resources. Don't provision things that you don't need to, and make sure you're reusing things over time. This means reusing resources like in the previous example. If you have an Application Load Balancer, we don't need to provision another ALB to expose a different app, a second app. We can reuse the same ALB with different routing configuration.

We also need to configure shared resources, so the cluster, the VPC configuration, the monitoring dashboards. We need all of that to be shared across all of the different apps. In reality, it's about balancing between the approach of replicating things for each and every app, but avoiding the complexity of having everything interconnected that you create. You can end up with a mess of resources in your system.

And the last thing, which I think is the most important one, is the break glass procedures. This is your escape hatches. No platform is going to be perfect for every use case from the start, and even if you graduate and introduce more features, you're not going to hit all the marks for all the deployments. So you need to give your users, the developers, an easier way to transition between platform offerings. If they use one deployment method, you want to enable them to migrate or move away in the future into other deployment methods within the platform.

We also want to enable them to extend the platform capabilities so if the platform doesn't support something, they can extend it with their own customized tooling, scripts, or anything else. And the last thing is about self-managing resources in the future. There might be a case, and we've seen it with a lot of customers, that users need to adopt self-managing resources because they have reached a point where they have very specific requirements that the platform cannot provide them. They need to be able to self-manage those resources instead of redeploying and migrating out of the platform. You don't want your users to feel they're trapped into the platform limitations.

Abstraction vs. Composition: Balancing Simplicity and Flexibility

And this brings us to philosophical questions that is an ongoing one between platform teams and developer teams, and it's all about how do you design the platform interface. Should we use abstraction, which is hiding the implementation of the underlying components, or should we use composition, which is all about combining resources with defaults but exposing the underlying implementation to the users, or maybe you should use a mixture of both of them?

And whatever you choose, it basically impacts the way users understand what's going on in the system. In reality, it's a balance of understanding what's going on or how things work under the hood and achieving the user's goal, which is in our case, getting from code to cloud. So let's dive deep into each of them separately.

With abstraction, users don't need to know what's happening under the hood. They can be free from any implementation decision and get started very quickly and easily getting the things they need. This brings some advantages like having a really low barrier for entry. We don't need our users to be an AWS expert. They can just spin up and use the platform in order to deploy their own resources. It also ensures consistency across the organization. Everyone is using the same patterns, everyone is using the same deployment mechanism, and it's easier to understand what's going on. Lifecycle management in this case is usually exposed through an API, which is an easier way to interact with platforms.

However, it does come with some challenges.

We do rely on platform teams to provision and get new features. Whenever we need to extend the platform capability, we need to wait for the platform team to build this and integrate it with the platform. It also creates some challenges when things go wrong. When there's an error and we need to debug or investigate something, we end up understanding what the underlying resources actually are and digging into the actual configuration, which defeats the purpose of having a lower barrier for entry. This is another thing we need to take into account, and it also creates some higher maintenance effort. This is a system like any other app, and we need to ensure we maintain this over time, including updates, upgrades, and operational efficiency. So in reality, it's a simple way to get started, but evolution is bottlenecked by the platform teams.

Composition, on the other hand, is different. It's all about automating the setup but keeping everything visible for the users. The good thing about that is that developers can use part of what the platform gives them and get the results they need. They can adapt it very quickly and slice and dice whatever the platform gives them and deploy it to production. Flexibility is also a positive thing here because they can adjust whatever template or whatever tooling they have in order to get the results they need, and what you see is what you get. There's no hiding, there's no hidden implementation under the hood, so when they need to see something or update something, they can just do it easily.

However, this brings us a couple of other challenges. It does require a steeper learning curve. You don't need to be an expert, but you need to have domain knowledge into the domain you're operating in. For example, in our case, developers do need to understand Amazon ECS concepts like ECS cluster, Application Load Balancer, and so on. It also creates fragmentation around deployment. Because each team can customize their own deployment method, this can lead to the fact that each one has a different way of deploying things, which from our platform perspective can be hard to manage across the organization.

And the last thing about lifecycle management, it can become tricky, especially when we talk about templates. It can become tricky to scale this across the organization. So the trade-off here is about requiring more domain knowledge from the users, but teams can evolve independently. Summarizing these two approaches, it would be nice if we can blend them together, if we can provide a simpler way to get started but without hiding the implementation details upfront. When you need to update something, you can still do it.

Blending Approaches: A Simple Interface with Visible Infrastructure

So let's look at the same entry point of what we started at the beginning, having an app deployed, an app packaged in Amazon ECR ready to be deployed. If we could provide a simple way to get started, like a very light interface that requires a couple of parameters and no more, but it does come with lifecycle management, so every time we update something, this service takes control of the entire lifecycle operation. And in turn, it provisions everything you've seen in the previous slides, so the cluster, the VPC configuration, the load balancer, the Application Load Balancer, and the integration with Amazon CloudWatch, but this is all kept visible to the users.

And whenever we have another app in the organization, it automatically uses the same deployment mechanism and integrates and reuses the same resources that we already provisioned. This approach balances between the level of knowledge needed to get started and the flexibility to change things in the future. In the next part of this session, Jen is going to cover how Amazon can help you achieve all of that.

Introducing Amazon ECS Express Mode: Simplified Container Deployment

Thank you, Zahi. I am so honored to introduce to you today Amazon ECS Express Mode. We introduced this feature just last week, and we made it for developers to experience ECS in a whole new way, taking advantage of all the things that Zahi just talked about, of all the years of platform experience that ECS has learned from. And we pass that knowledge onto our users so that you can stand up your applications faster.

We started the way we do all applications and products at Amazon. We worked backwards from the customer, and we found that most ECS customers were implementing a really repeatable pattern, the same one that Tsahi was showing earlier. However, a lot of that pattern existed outside of ECS. It included other AWS services like load balancers, auto scaling, domain names, certificates, networking, and observability. So we asked ourselves, could we do what ECS does best and relieve customers of that burden? Could we take more responsibility?

Let me show you what we've done. To create an app with Amazon ECS Express Mode, you only have to give us three things: your container image and two IAM roles. We take defaults for everything else, but there are additional configurations available. As soon as you complete the deployment, you do get an application URL, and that URL allows you to test your application end to end. Those options do allow you to configure the application, and we'll go through what all those options are in just a minute. But they're greatly reduced compared to the hundreds of parameters that you would have had to go through on all of these resources that we're provisioning that I'm showing you on the screen now.

Express Mode provisions all of those resources you need to stand up a highly available, scalable containerized service using AWS best practices. Those include things like canary deployments, alarm-based rollbacks, TLS certificates, auto scaling policies, CloudWatch logging, availability zone rebalancing, and minimally permissive inbound security groups, all configured and wired together so you don't have to think about it just to get started. And at the end of this deployment, you have a live application. The command line view is similarly very simple: one container, two IAM roles.

The first IAM role is the task execution role, which if you're familiar with ECS, you'll know is what we use to get your container image from Amazon ECR and to set up your logging. The second one is new. It's an infrastructure role, and that's what we use to provision all of the resources on the right. And you might ask how did we come to this architecture and the defaults that we selected. We were trying to balance a couple of things.

One, at AWS we love data. So we looked at what our customers are configuring today, and this was a very common pattern. Customers were configuring services and not individual tasks. They were hooking up load balancers as opposed to other types of networking. Two, we took a look at the best practices and we talked to our principal engineers and solution architects. We had a lot of really difficult conversations, a lot of heated conversations, and we looked at how do we strike a balance between helping customers get started fast with dev workloads and test workloads, and also making sure that you're set up for the long term.

Because really we want to make sure this is a place that you're not overburdened by what you would need in order to run a production workload, but it's also a place that you can get started and know that you have everything in place to run that production workload. So we do things like set up all of the subnets that you would need and all of the availability zones and make sure that we turn on availability zone rebalancing so that when you're ready, you can just increase your desired counts to three, and now you are highly available in three availability zones. But we start with just one so that you're not burdened with so many tasks and how to deal with the cost of all of that. All of this is built around trying to make things really simple and easy to get started.

Express Mode Features: Interactive Monitoring, Lifecycle Management, and Resource Sharing

And this is something we're really excited about, something you don't see often in the AWS CLI. We know that a lot of developers love working right in your IDE or terminal. If you add this monitor resources flag to your call, you'll get the following interactive experience. This is really similar to the console, so you can see what's happening in your ECS deployment in a really super visual way. And this is happening on the create, the update, and the delete of an Express Mode service.

We feel like the observability of something that, to go back to what Tsahi was talking about, a composition is a really important concept. So in both the console and the CLI experiences, you're seeing the ARN of the resource, the status, and any events or errors that we're receiving are being piped through to these views so that you have a really robust understanding of what's happening in the environment.

I just alluded to this, but back to what Tsahi was saying about lifecycle management, Express Mode is a complete lifecycle. There is a create, an update, and a delete of this experience, and if you are content with that experience, you never have to learn about the underlying resources. You don't have to go look at your Application Load Balancer or your target group or even understand what all the settings are in the task definition.

For novice users, this can be a really great way to experience launching containers. We also do a lot of things here that make things in ECS a lot simpler than they are today, things like updating the port or the health check. If you've done that today in ECS, you know that's a potentially disruptive and really difficult thing to coordinate. We handle all of that for you in Express Mode. Moving from a public service to a private service or vice versa is a matter of handing us different subnets and pushing update. There are things that we do in Express Mode that are very complex that become very simple.

Let me show you. For observability, we show you the normal CPU and memory, but because we have a load balancer, there's also target response time, 4XX, and 5XX errors. You also get your application logs. We added a new Resources tab that you've seen in the timeline view, but it will also have the list of deep links. Now for the update, this is going to show all of the create options. You have your container settings, port, health check, environment variables, secrets, commands, and the task role that you can add to access other AWS services via IAM.

For compute, you have CPU and memory, and auto scaling. We have CPU, memory, and requests, minimum and maximum. For networking, you have subnets and security groups, and you can also name your own log groups and log stream prefixes. When we delete, you can also see the process of deleting. We're going to delete any resources that are unique to that service.

You might be asking yourself about infrastructure as code. I would love to tell you about that. On the left, we have the full CloudFormation of the Express Mode architecture, and on the right, we have the Express Gateway Service resource with the required parameters. I'm really excited about that reduction. I don't know about you. Yeah, thank you. You can clap. It's a silent session, but you're still allowed to clap.

Here we have the optional parameters as well, but I've highlighted where you can bring in your own resources like a cluster or a subnet or a security group. This gives you some flexibility to bring your own definitions in. Now Tsahi also talked about economies of scale, and I'm super pleased to share that Express Mode services that are deployed to the same set of subnets in the same account will share Application Load Balancers. We do this using host header-based listener rules. And not only do we share them, we also scale them.

What do I mean by that? Let me show you. So we have a load balancer here that's been provisioned by Express Mode. And in the listener, you can see that we have 25 Express Mode services. We can share an Application Load Balancer with up to 25 Express Mode services. They also all have unique TLS certificates. Now I'm going to go back to ECS and to Express Mode and provision a twenty-sixth service so we can see what happens.

Just pulling everyone's favorite NGINX image here. And I'm not even going to wait for the deployment to finish because it's already going to kick off. We're going to go back to the EC2 console, look at the load balancers, and see that we have started provisioning a second load balancer. And so this is what I mean by scaling the load balancers.

When you provision that second service, we will provision a second load balancer. When you delete the second service, we will delete that second load balancer. Anything that is unique to the service will be deleted.

Composition Within Abstraction: The Unique Power of Express Mode

Now, there are a lot of services out there that will take a container and give you a URL. There are even some in AWS today already, so I don't want to be shy about saying that. But I do want to be clear about what I think is really unique about Express Mode. And that is that you have access to all of these resources that are in your account and you have the ability to mutate them. And because of that, going back to what Tsahi was talking about with abstraction and composition, and if you disagree with me here, I would love to go and have a philosophical discussion with you afterwards, it might be kind of fun, but I believe that Express Mode is a composition within an abstraction.

You have the ability to stay within the abstraction if you want to. You never have to know what's underneath you. You get the simplicity of the abstraction and you get to have that. You stay within that world if you want to, but underneath that is a composition and you get access to all of the goodness of that composition and the flexibility of that if you want to. Now also, you don't have to graduate or migrate in order to get access to that. And in many services, in order to go from one state to another, you have to draw that hard line. You have to say, okay, well I have to forfeit that simplicity in order to get access to that feature that I need, and not with Express Mode.

With Express Mode, if you want to go and access that thing that you need, you can go and turn that on. You can go and configure that parameter, and then you can come back to Express Mode to continue updating your image or your auto scaling or whatever it is that you're doing on a daily basis. And I think that is very differentiating, to be that, that empowers you as a developer to keep a simple model but also have access to the full feature set of ECS. ECS that has been around for more than 10 years, that has been operating very large services for more than 10 years, and all of these other services that we're provisioning as well.

Application Load Balancer is also a very robust service in itself. You're getting access to all of the features of these services in order to provision whatever you need to do with your application. You're starting simple, but you have access to whatever you need. Let me show you a little bit of what I mean. So we're going to take an Express Mode service, go to the resources tab and make a change to the task definition. So I'm going to go right into the JSON. In the ECS console you can actually edit the JSON. So I'm going to add a second container definition.

This is what we call a sidecar if you're not already familiar. A lot of customers in ECS add logging sidecars. A common one is called Firelens, so you can send your logs to another destination rather than CloudWatch. And in order to update a service or your task definition in ECS you need to create a new task definition revision and then go and update your service. So that's what we're doing here. We've just updated the service and we're watching that deployment happen back in the Express Mode service.

Now what we want to see now that we've done that update out of band from the Express Mode service is can we go back to Express Mode and do another update and make sure that that change was persistent. So we're going to update the image in Express Mode. We just moved to the latest in my Amazon ECR repo. Go back to the JSON and look. My image is the new one, that's the latest SHA, and then my second container definition is there.

So that's kind of the power is that you have the ability to make changes in these resources. We will persist the changes and you can continue using Express Mode for those simple updates to everyday changes. Like ECS itself, Express Mode is available at no additional charge. You only pay for the resources that are provisioned and in the context of this that would be Fargate and the Application Load Balancer.

We're helping distribute the cost of these across all of the services. We showed the console, we showed the CLI, API, SDK, we showed infrastructure as code. It's also in the CDK as an L1 construct. We launched Terraform last week, and I'm super pleased to announce that about two days ago, one day ago, we launched a new GitHub action. So you are able to take a GitHub repo, use the other GitHub actions to build that into a container, push it to ECR, and then use our action to push it to an Express Mode service and get a URL right from your repo.

So with that, Express Mode was built for developers to help them get started fast in ECS. We've also been talking to a lot of platform teams who see this as a way to accelerate their developer experience, to give an experience to their application teams or to stand up small POCs for things that they're less concerned about what the architecture looks like. So even if you're a platform team, maybe consider checking it out.

But with that, I think we also want to take a look today at how platform teams are using Amazon ECS today and what are some best practices. GoDaddy has done an excellent job of that, Keith specifically, and we're really excited to show you what he's done. He did not have access to Express Mode or ECS Managed Instances or native blue-green, but I think I've certainly drawn a lot of inspiration from what he's built, so I'm excited for you to see it. Thank you, Jen.

GoDaddy's Journey: From Decentralized Chaos to Unified Platform

Hi everybody, my name is Keith Bartholomew. I am a Principal Engineer at GoDaddy. As Jen said, a lot of what I'm about to show you today is how we've used ECS to build a platform for our GoDaddy engineers, and we did all that before features like Express Mode, before their blue-green canary deployments, and before Managed Instances. So everything that I'm about to show, recognize that it's about ten times easier for you to do this yourself if you wanted to. We did it the hard way and now you get to do it the easy way.

So you may know of GoDaddy as a domains registrar. That is the oldest business, that's really what we're known for, but today we do a lot more than that. We call our customers everyday entrepreneurs. They're the kind of person who has a side hustle, a business that they run on the side. It's not their main source of income, but it's something that they're really passionate about, and we provide services that help these everyday entrepreneurs do everything they need to do to run a small business. And so that does include getting a domain name, getting a website, but also running an e-commerce storefront or taking point of sale payments at a farmer's market or something like that. And so we call all of those touch points the Entrepreneur's Wheel.

Now just like generative AI has changed the way that we as engineers and technical people work, it's also changed how our everyday entrepreneurs do the things they need to do. So GoDaddy Airo is our AI-powered experience that runs the gamut of every product at GoDaddy. From getting the idea for a domain name to generating a logo on a website and even doing social media marketing, we're taking these ideas, those back of the napkin business ideas that seem out of reach, and we're putting them into these everyday entrepreneurs' hands so that they can make every business idea more accessible than it ever has been.

Now behind all these innovations are thousands of builders working tirelessly to stand up all of the services that it takes to run something like this. And so that's where I'm excited to tell you today about GoDaddy's decentralized developer platform. So it all starts with CI/CD where our developers do kind of whatever they want to. So some of them are running self-managed Jenkins clusters. Some of them will stand up GitHub Actions workers when they need to, and a few people have hand-rolled their own CI/CD things because they think that's important.

Infrastructure, we really want to let developers choose, so there's a little bit of ECS, there's some EKS, and some of our teams are using OpenStack on-prem. And security, we really don't want to get in their way, and so we let them patch their vulnerabilities whenever they feel like it. It's none of our business, right? And observability, we've got people who like Prometheus. Some people really enjoy that and they want to use that and that's fine. Some people like Elastic and they can do that too.

This is a pretty cringe-worthy slide, right? Can we all agree that this is not what we want to see in a platform? Yes, okay. So this is with some hyperbole, this is roughly what it looked like at GoDaddy before we started building this platform, and so we took a lot of the good things that were happening in these things and we brought them together to make a centralized unified platform for all of our engineers. So our engineers have access to push-button hardened private GitHub action runners that are automatically hooked up to our IAM system that we have, so they can just deploy with literally a push of a button.

Katana itself, that's the name of our infrastructure platform, is a managed compute platform that makes them have push-button access to ECS Fargate services, ALBs, and all the other supporting things they need there.

We've unified our securities. We work really closely with our security and governance teams to make sure that all the Lambda runtimes get updated when they need to get updated, that all of the ECR images are free of vulnerabilities and things like that, and we're able to do that through a really close coordination with that team. And then centralized observability. We have a grade A world-class observability team who I'm so lucky to work with, and they've done a lot of work to make sure that everyone using this platform with zero config gets all of their logs, traces, and metrics piped to our centralized observability platform where they can then go create dashboards and alarms and understand how their applications are behaving.

Katana Platform: Single Pane of Glass and AI-Powered Support

So the way that most of our engineers interact with this platform is with this unified single pane of glass. I know that's kind of an overused term, but our developers really enjoy this. And to underscore the value of this, consider that there's over 2,000 engineers at GoDaddy, and for each of those teams they're following the best practice of having a dev account, a test account, a stage account, and a prod account, and we have some others in there as well. And then also consider that large organizations are messy. Most engineers don't work on a single project. They kind of share their time with three, four, or five projects.

Since we have engineers who need access to 30 AWS accounts, they need to deploy that application to 30 different AWS accounts. Going around and clicking through all the AWS consoles can be pretty tiring, and so we've brought all that together where our engineers have a single place to look to see how is this service doing, how did that rollback happen, what were the logs that happened from ECS when this failed, how is this being impacted by an outage with a different app. They're able to get all that together in a single place. They also use GitHub Actions, which we manage a lot, to do the same things from more of an automated CI perspective.

This unified dashboard really comes into play and is really valuable when it comes to observability. I can't tell you how many times I've been paged and I get on the phone call with the product team and the first thing that I see on the screen share is the screen. They're looking at this to see how is the load balancer in US East compared to the load balancer in US West 2, are these error rates spiking over here, are they idle over here. And our developers really enjoy using this as a way to understand at a really big picture level how is everything across multiple accounts, multiple services, how is everything behaving.

We've also built in generative AI support, so even though we've gone through a lot of effort to make our platform as simple as we can and easy for any engineer to understand, it's still complicated, right? And there's some times when there will be an ECS failure that we can't really wrap or abstract, and so we just have to show them the direct failure from ECS and the engineer doesn't really understand what that means. And so we've built an agent right in that dashboard that you just saw that has access to a Bedrock knowledge base with all of our documentation as well as information about how we specifically tailor our infrastructure to their needs, and it's able to use tools and function calling to reach into their account and live see, okay, how is your service doing, let me read the deployment events, let me read the CloudFormation events and understand how that's going.

And then it provides our users with a very detailed, not generic, and actionable statement that they can then use to self-resolve their problem, or in some cases they'll even use this with another tool to come create a ticket in our support channel where they can bring an entirely pre-triaged incident to us and then we can start working on a support issue when we need to there. So it's been very, very powerful for us.

Secure Multi-Account Access with Low-Cost Agent Architecture

So at this scale, right, we've got hundreds and hundreds of AWS accounts. Accessing all those accounts is essential. Obviously we need to be able to see what's going on in them as a platform to understand the health of our many services that we're managing. Our users need to see that information, and we need to do that securely. And so the way that we've done this is by installing an agent, and this is not the AI kind of agent. We named it that before they took that word from us. The old-style agent. We run one of those in each account, and that agent comprises an API Gateway and several Lambdas, and we chose these because they're very low cost to run.

When they're idle, they consume practically zero dollars. I think we paid maybe like one penny for the S3 code for the Lambdas. And that's great because it means that someone who's using the platform and using us for our management capabilities, they're not taking on an undue burden just to use the platform. So the control plane of what we're doing is as low cost as possible. But security is also critical here, and so API Gateway allows us to use IAM authentication for that entry point where we can say only the IAM role in our dedicated management account is allowed to invoke this API Gateway, and all that happens without me writing a single line of code. And I can't tell you how much that helps me sleep at night knowing that my code did not contribute to any problems. I trust the AWS IAM engineers much more than I trust myself.

But we also then limit the things that this agent can do, right? So this is not carte blanche access to everything within the account. We need to get select information about ECS, about CloudFormation, about Application Load Balancers, but other things that are not any of our concern, there are not Lambda handlers that are capable of doing that. So we're limiting our blast radius in that way, and really the only write operations that we do are managing CloudFormation stacks or, what else do we do, we restart ECS

Push-Button Resilience and Blue-Green Deployment Strategy

services from time to time if you just need to, you know, sometimes rebooting fixes the problem. What our engineers get from all this though is push button resilience. So 2,000+ engineers, they come from a wide range of backgrounds. Some of them are really, really good DevOps engineers. They're intimately familiar with every AWS product. They're certified. Some of them could not tell you the difference between an ALB and an NLB, and we need to make sure that all of those engineers get the same experience.

So it all starts with Route 53 records that ensure that anyone who's running on our system is able to get latency-based routing between regions and then failover between those regions whenever there's an outage. And so some of our teams are operating in four and five different regions. Some are just in two. It really depends on their needs. They don't have to configure this failover. They get it out of the box. Those Route 53 records then point to the networking resources that take you into the application. In most cases that's an Application Load Balancer. We've also got teams who are using CloudFront to help provide more of a point of presence at the edge. And so we're integrating all that directly with the Route 53 records.

And then you get down finally to the ECS service. And so for each deployment we're running a new ECS service and then we allow some fairly complicated routing from the ALB to the service where engineers can either do 50/50 splits if they're doing an AB test or they can use a 5% split for a canary or something like that. So they're able to get a lot of these really kind of detailed and fairly complex deployment strategies again at the push of a button.

So key to that, what I was just mentioning with these deployments, every deployment on Katana creates a new ECS service, and we do this so that we have a really reliable sort of warm standby in the case of any bad things happening. So let's say here that version one of my service is what's running in production, and so all of my users are going to version one, and as little as possible, I really don't want to touch that service. Right, it's like Indiana Jones and the Temple of Doom. If it's not broke, don't fix it. And so we don't want to touch Service one, but the development team is working on version two. They're making some changes and they need to deploy that.

And so when they finally make that deployment and they deploy version two, it stands up a completely isolated ECS service, a completely isolated target group, and service one is not touched at all. It stays exactly pristine the way that it was. Now you do need to know that version two is working and so with a specific header we have X version two or a query string that says X version two or even a cookie, you could have your QA team test version two of the application or you could run automated smoke tests against version two of your application to prove that it's working the way that you want it to.

And then finally, when you're confident with that, you make a small change, and the only thing that changes is we change a target group to point to version two and not version one. Again, version one has not been touched, we're just moving traffic away from it. You can do this all at once if you want to go YOLO, or you can do it slowly with a percentage route to sort of ramp traffic up to version two of the application. And so then in this case we've got version one running as a warm standby. So if when it comes under load version two starts to have a degradation, version one is still there as a fallback.

Again, this is just built into the platform. This is a fairly complicated way to manage deployments, and our engineers don't think about this on a day to day basis, but they have used it in some really creative ways to have live preview environments. Sometimes they're not just running two versions at a time, they're running five, six, up to a dozen, I think I've seen, where they've got really busy teams who are constantly testing things and deploying these using these sort of dark release patterns.

Escape Hatches and Governance: Empowering Users Without Sacrificing Control

Now, as Tsahi and Jen both mentioned, escape hatches are really important for a platform, right? I think we've made a lot of good choices about the correct kinds of abstractions, and in most cases that works, but every now and then people need to sort of grow up from that or they have something that they really want to do in their own specific way. And so one example of the way that we do this is with WAF. So every load balancer that we provision has a WAF ACL, and I'm not sure if you've ever configured a WAF ACL on your own, but they're fairly complicated. The JSON language for expressing rules and bypasses and the orders, it's a lot. It was a whole day of documentation reading for me.

And so we don't require our users to know all of that. We just say, you know what, most of you are going to need a rate limit. You're going to need to allow and block certain subnets. We're going to give you AWS managed rules to prevent common attacks. But if you find that your app really needs a large body size in a request, you can opt out of those things as it meets your needs. And so most of our users are very successful with this, and they never touch the JSON language of a WAF ACL.

But every now and then we get someone who has a really complicated IP set and they need to do something that this doesn't support, and instead of trying to build that into the abstraction and slowing them down while we build these features, we just say, okay, go create your own WAF ACL. Tell us the ARN of the ACL that you created, and we'll associate that with a load balancer, and then we move on. And so they're able to take control. They inherit the responsibility for the WAF ACL, and they don't have to sacrifice any of their benefits they get from the rest of the platform. They continue using all that as they did before.

We do this with a few other things too, so security groups, we'll automatically configure security groups to allow the things that we manage to talk to each other, but if they have a complicated security group, they can pipe that in. Or with IAM policies, we'll let them define that inline if it works, and then sometimes those policies get a little bit large to manage, and so we just say, okay, just tell us the ARN of your policy, and we'll attach it to the task role.

Everyone at GoDaddy uses CDK. We have centralized on CDK as our management tool for all of our AWS infrastructure, and it's very common for us to find developers that use Katana for what it's good at, but then they need to extend it. They need to associate that ECS task with a DynamoDB table, or they want to change the behavior of something. And so they're able to use constructs to extend what Katana provisions within their own CDK code. So you could say I'm going to create a new Katana environment. This is actually just looking up and adding to context values. What do I have here? The listener gets a random string that they can't predict ahead of time, and so we'll look that up live, get that context in there, and then they can use this piece of context to add their own custom listener rule later and take the infrastructure that we've managed and augment it in a way that meets their needs.

So all these guard rails and off-ramps are very important. It's all built around a governance system that I think one of my colleagues in the row over here is using. We use CloudFormation hooks to restrict what people are able to do, so that developers really can't step a foot wrong. So you can't create an ECS service that has a public IP. It just doesn't happen. And Katana builds on top of this governance system, which gives us a lot of confidence that we're not going to do something wrong as a platform. And anything that you do within the platform is going to adhere to these rules.

Like I said, we're all using a common CDK tool set, and so that gives us sort of a lingua franca to use between teams. CDK is the way that we communicate. The Katana managed infrastructure itself is actually a very, very large CDK application. And so in some cases, users are able to just learn from our code, and I'm fine with that. They can steal the source code and say, oh, I see what you did there, and apply it to their own things. And then there's no magic, so there's nothing that our platform does that users could not do themselves. We don't have magic permissions. We don't have any elevated things that we can do this but you can't because only we're trusted to do it. We're doing things that users could do themselves, and we're just accelerating their path to do that. Which gives them a really powerful off-ramp. So if they ever needed to take one of our CloudFormation stacks and manage it themselves separately, they can do that and they'll have no issues doing that within our governance framework.

Enterprise Integration and Why GoDaddy Chose ECS Fargate

Now at a company of GoDaddy's size, managing infrastructure is important, but really where our engineers start to see the value is in how we glue that infrastructure to the rest of the business. So like I said, we have a very good observability team, and out of the box, every one of the ECS services that Katana manages has a Fluent Bit sidecar that uses FireLens to send their logs to our observability system. We're working on a shared OpenTelemetry Collector that collects those traces and does the same thing there, and this is huge for our engineers. They don't have to do anything to get this benefit. It's just there for them.

Compliance and security, like I mentioned, we work really closely with those teams to make sure that everything is always patched constantly, and because of our kind of oversight of the entire platform, we're able to do this at a scale that individual teams can't do on their own. GoDaddy is a big certificate registrar or certificate issuer and DNS, and so we work with our internal certificate API to let the engineer who's provisioned their certificate get that imported to the right AWS account so that it works with their load balancer and get those renewed automatically. And then experimentation. This is, experimentation is a huge part of the DNA of GoDaddy, and you might think of experimentation as just a front-end concern of just AB testing in a browser, but we believe that even infrastructure changes are experiments as well. Version 2 of my application might solve a performance bottleneck that version 1 had, and that might improve a click-through rate on a checkout funnel or something. And so every time a new application is deployed or a new instance version of an application is deployed on Katana, we send that data to our experimentation platform so that our data scientists can correlate that with their hypotheses and other data sources to see how that feeds into that.

And then enterprise networking is tricky. It's messy. We've got a mix of AWS accounts on-prem. I think if you need on-prem access, I know how to do it. You have to attach yourself to a certain subnet group, something like that. For our users, it's a checkbox. We do these things to work with those teams and make sure that networking is as easy as we can make it.

Now why ECS Fargate? There's a lot of compute platforms at AWS. Many of them support containers, which is good for us. We chose ECS Fargate because it really met the happy medium for cost management and flexibility for us. Management is very important, so we have a small team that runs our platform, about 7 or 8 engineers, and we have 2,100 engineers that we have to serve. So if we're patching EKS instances every day, can't do it. It's not manageable to do that at that scale. So ECS Fargate gives us the ability to not have to manage or patch those instances, and

like I said, this was invented before ECS Managed Instances, which changes that a little bit, but we also don't have to deal with scheduling or anything like that. And as someone who has previously run ISTO exactly once, I can tell you that this is so nice not to have to do that. But we don't sacrifice anything with our flexibility, so they still have full access to all their container settings, to VPC connections, things like Service Connect, and it's a good cost fit for the kind of services that we run because we have a lot of always on customer facing web apps that work really well for the way that ECS is priced.

Lessons Learned: Building Platforms for a Spectrum of Users

So finally our lessons learned. Our users really value the single pane of glass. I underestimated how much they would enjoy that. Operating via CloudFormation is kind of difficult. CloudFormation has a very static way of expressing infrastructure, and ECS services are very dynamic. You make a deployment and that creates hundreds of events, and it's checking health checks, so there's quite a bit going on there. And so getting the impedance to match between those things has been something that we've had to put some work into.

But it does have its perks. If we ever get ourselves back into a corner, we can move traffic away from a region, delete all the stacks, recreate them exactly the way they were before, and the app is running as we want to. And we have to partner with other teams. We kind of call ourselves a platform of platforms. We're all thinking with a platform mindset, but I can't fix everything myself. And so working with our observability teams, our security teams, our governance teams has really been essential to making a platform that meets the needs that our developers have.

And so I'll leave you with our philosophical tensions, so there's one more philosophical tension to share with you. I think it'd be a big mistake to think of your developers as one kind of person, to say that on this spectrum of being an AWS expert or an AWS newbie, that there's only one kind of engineer at your company. You're obviously going to find people who exist all across that. And so what we found is that to meet their needs you have to understand that they have a wide spectrum and build a platform that meets all of those needs that allows the AWS experts to thrive and have the control that they want and that allows the less AWS centric people to get the job done without having to know everything or get certified or anything like that.

So as you leave, if you're going to consider building platforms or if you're working on a platform of your own, please consider making your platform meet as wide of a spectrum of your engineers as you possibly can so that you can get success for a wide range of people, and I'll hand it back to Jen. Thank you.

Conclusion and Resources for Getting Started

I'll say it again. I feel like I've drawn so much inspiration from what Keith has built from GoDaddy. I know we're already thinking about things we want to add to our road map from some of the things that he's done. So I thank you again, Keith, for sharing what you've built. I hope you gained a lot from this session. Please try out ECS Express Mode. I hope you've learned a little bit about accelerating application development, about building platforms on ECS.

Check out this QR code. It is specific to this session. You can download a PDF of the slides. We also have a link to the ECS immersion day workshop which now includes ECS Express Mode, and we also have a link to the GitHub action that we just launched. Thank you again and please fill out the session survey. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.