DEV Community

Cover image for AWS re:Invent 2025 - The future of Kubernetes on AWS (CNS205)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - The future of Kubernetes on AWS (CNS205)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - The future of Kubernetes on AWS (CNS205)

In this video, AWS presents major EKS enhancements including EKS Ultra clusters supporting up to 100,000 nodes and 800,000 GPUs, Provisioned Control Plane for predictable performance, and EKS capabilities with managed Argo CD and ACK/KRO. Netflix's Niall Mullen shares their migration journey to EKS, moving hundreds of thousands of containers across four regions in just over a quarter. Key updates include enhanced container network observability, ECR archival and managed image signing, cluster deletion protection, AWS Backup integration, and the hosted EKS MCP server. The team details architectural innovations like etcd improvements using AWS's journal system, multi-network interface support for 100 Gbps bandwidth, and concurrent image downloads. Future priorities focus on eliminating platform building complexity, enabling workloads across multiple clusters, and deeper AWS service integrations to make Kubernetes truly managed.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 10

Thumbnail 40

Welcome to the Future of Kubernetes: Netflix's Journey to EKS

Good afternoon and welcome to re:Invent and welcome to the Future of Kubernetes. I admit that's a bit of a mysterious sounding title, but I assure you we're going to cover a real concrete overview of recent enhancements to EKS as well as a sneak preview of what's coming in the next couple of years. About six to nine months ago, Netflix came to us with a vision that I imagine some of you in the audience have had. They said we want to use Kubernetes, but we want to make scale and operations somebody else's problem. I'm lucky enough today to be joined by Niall Mullen, who is the Senior Director of Cloud Infrastructure at Netflix, and he's going to talk about how Netflix migrated to EKS in a matter of months, not years. Then I'm going to cover how we're making that philosophy available to all customers no matter their size.

Thumbnail 70

We're also joined by Eswar Bala, who is the Director of Engineering for Containers at AWS, and he's going to talk about some of the scaling innovation we've done over the last year to make sure that we support the very large scale that Netflix runs at. I assume everybody in this room is familiar with Kubernetes. This data is from the latest CNCF survey in 2024, which shows that now 80 percent of enterprises are using Kubernetes in production. I think that's up from 66 percent when it was done in 2023, so over 90 percent of companies out there are at least evaluating, if not using Kubernetes at this point.

Thumbnail 100

Why has Kubernetes become so popular? Simplicity is the biggest reason. There was a recent tweet that said something like Kubernetes is the wrapper of 15 years of bash scripts, runbooks, and cookbooks that SREs have made, all wrapped up behind a simple to use declarative reconciling set of APIs. That really resonated with me and what I've heard from customers as well. Kubernetes is a very simple way to manage your cloud infrastructure. The other two reasons that come up are consistency and extensibility.

Thumbnail 130

With simplicity first, you're never starting from scratch when you're using Kubernetes. If you need to run a Spark job, you go use the Spark operator. If you need to run an ML training job, you go to a project like Kubeflow. Very rarely are you starting from scratch, and that's simple. You don't have to write a lot of custom stuff. Consistency is the other reason that we hear a lot from customers. You can run Kubernetes on AWS, on premises, and other clouds, on the edge, on fighter jets, which we've seen with EKS. You can run it anywhere. Then there's extensibility. While Kubernetes is simple, it doesn't cover every single use case you might need. With the CRD model and its extensibility, you can customize it to do anything you want. We see more and more customers today starting to use EKS and Kubernetes to operate their entire business, not just orchestrating containers.

Thumbnail 190

We announced EKS back in 2017, I believe a preview of it at re:Invent. It's now been eight years since we launched EKS, and we've advanced from the early days where it was just to manage the Kubernetes control plane. We've launched open source projects that have become the new standard in the Kubernetes ecosystem. We've expanded beyond the control plane to do add-ons and data plane, and today we'll cover what our latest expansion is with EKS capabilities that we announced just last night.

Thumbnail 240

Understanding Kubernetes in Context: Infrastructure to Developer Tooling

Kubernetes in context means your goal is to deliver business value, not necessarily operate infrastructure. Our goal is to deliver the fundamental components that you need to build a production-ready Kubernetes environment. What does that mean at the infrastructure layer? Compute, networking, storage. None of that's ever going to change. One of our jobs on the EKS team is to make sure you get those fundamental building blocks that AWS gives in a Kubernetes native way, things like our CNI that integrates with the VPC, or Karpenter that integrates with EC2, or our storage drivers integrating with storage services. That foundation is super important, and we're always investing there.

Then there's the control plane, which is the hardest part of doing Kubernetes, not for the faint of heart. Nearly everybody who's running Kubernetes on AWS these days, including Netflix now, is using EKS. They're not doing it themselves. Management tooling, governance, compliance, and security are the next layer you have on top. We've expanded to start to help you there. Then there's developer tooling. I mentioned in that earlier slide that I think 80 percent of customers are evaluating Kubernetes now. Some recent data has actually shown that developer familiarity with Kubernetes is decreasing. You might say that's a harbinger that Kubernetes is losing popularity, but we actually see it's the opposite. Kubernetes is just becoming a layer in the stack like Linux has become, where developers don't really have to think about Kubernetes anymore.

Thumbnail 320

They just use their familiar tooling, deployments, IDPs, jobs, and workflows, and that runs on Kubernetes underneath, but it's not something they have to be terribly familiar with.

Thumbnail 340

Thumbnail 350

ECR Enhancements: The Unsung Hero of AWS Container Services

I want to talk briefly about container registry and ECR. I think it's the unsung hero of the AWS container world, and it's something myself and Ishwar focus on in addition to Kubernetes. I'll go through quickly some of the advancements we've made in the container registry space. ECR does over 2 billion image pulls a day. I think we like to think of it as the unsung hero of AWS container services. Every application you're running in a container orchestrator, if you're using ECR, it's starting with an image pulled from there.

Some recent enhancements we've made include enhanced scanning. We did an integration with Inspector where we provide advanced scanning for images, and one of the features that we're most excited about—and I think myself and Ishwar especially talk a lot about how we can do better together across container services, how EKS, ECS, and ECR can work better—this is one of those examples where we said how can we make security folks' lives easier. We run a scan, you get a report of a vulnerable image. Well, now where is that running? You might have dozens, hundreds, or thousands of clusters depending on the environment you're running. So this feature we launched a few months ago makes it easier to see a live inventory of vulnerable images that are running.

Thumbnail 400

Another enhancement is authenticated pull-through cache, and one of the enhancements we made was ECR to ECR pull-through cache. With ECR we let you pull images from upstream registries like Docker Hub and ECR Public, but you can also pull within your own ECR repositories. That's across regions and across accounts, and that's a feature we made earlier this year.

Thumbnail 420

Some other notable 2025 launches, and the one I want to highlight—I believe the highest upvoted issue on the entire containers roadmap was this tag immutability one. Certain images you want to upload and they can never change. There's security and compliance reasons, but the latest tag specifically, often in development workflows, you are okay with changing that. Within a repo we now give you that flexibility where you can set certain tags, for example latest, to have image immutability.

Thumbnail 450

Our two new launches for re:Invent—one is archival. We talked to a lot of ECR customers that when you get to a certain scale, for compliance reasons they still need to keep images around. However, you can be running up terabytes of storage that you don't necessarily need. You may need it someday because an auditor or compliance comes back and says you need it, but you don't need that in your primary registry to be pulled down for actual usage. With ECR archival we now give you an easy way. You can look at the time an image was last pulled to decide if you want to archive an image and put that into a separate storage class that comes at a lower price. If you do for whatever reason need that image back eventually to run in workloads, you can bring it back from the archive storage class, very similar to services like S3 which have different tiers of storage classes. We now have that in ECR.

Thumbnail 500

The last one in ECR is managed image signing. You're always going to hear about security in an AWS talk. It's always one of the top things we think about. We now provide managed signing in ECR so you don't have to run separate infrastructure. It's fully automated and then it's integrated with AWS Signer. It's integrated with all of your tools in AWS like CloudTrail. Now with a simple API call you can sign an image and have that automatically done in ECR.

Thumbnail 530

EKS Upgrades: Making the Inevitable Less Painful

Okay, back to Kubernetes and EKS. We run lots of clusters—tens of millions of clusters. This is tracking the number of create calls we see to our API over the course of the year, and I think we all agree Kubernetes is awesome, but Kubernetes is hard to run. We make it a point in EKS that we are always going to run vanilla upstream Kubernetes. Conformance is super important to us. We're never going to run a version of Kubernetes that differs from what you get in upstream. If your workload runs in Kubernetes somewhere else, it's going to work in EKS, and that's a super important tenet for us.

Thumbnail 580

I'm going to now cover quickly a lot of the enhancements we've made over the course of the year for EKS upgrades. It's a word that probably makes a lot of you in this audience shudder when you hear about Kubernetes and EKS. I think death, taxes, and Kubernetes upgrades—when you make the choice to adopt Kubernetes, upgrades are just something you have to stay up with. We recognize that. We know it's painful, but when you're adopting any managed open source service in AWS generally, upgrades are part of that. A few things we've done this year include cluster insights.

Cluster Insights scans your clusters and looks for certain things that may impact your ability to upgrade to the next version of Kubernetes. Are you using deprecated APIs? Are you running a kubelet version that's more than two versions behind? There's a whole host of things we look for, and we surface that to you through Upgrade Insights. You can now refresh that on demand. One of the pieces of feedback we heard was that we surface this once a day, but if you go fix that, it takes too long for you to figure out if it was addressed or not.

The other thing I'll mention is Kubernetes version support acceleration. We've put a lot of effort in over the last couple of years to make sure we stay up to date on Kubernetes versions. They come out every four months now. If you look over the last two years, I believe every single release of a version in EKS has been within 45 days. If you had talked to us a couple of years ago, we were not there. We've put in a lot of work to make sure that's the case, and you can be assured that moving forward, within that 45-day window, new versions will be available in EKS.

Thumbnail 670

Upgrade Insights is a hidden feature. I talk to customers and they're like, "Oh, we didn't even know you surfaced that." It's available in the console if you go to the Monitor Cluster dashboard. These run every day by default, and you can now refresh them yourselves. But every time we release a new version, we're checking to see if there's something new we should look for. In version 1.33, we started checking whether you're using Amazon Linux 2, because that's no longer supported in 1.33. This is something we're always improving over time, adding new things to look for as new versions come out.

Thumbnail 700

EKS Global Dashboard: Cross-Account, Cross-Region Visibility

One of the pieces of feedback I got from customers is that they need to upgrade but don't even know which clusters they need to go upgrade. They have clusters across dozens of accounts and dozens of regions. Account managers are reaching out saying, "Hey, can you go figure out where all my customers' clusters are? They know they need to upgrade, they just can't find them." As customers run more clusters across these various boundaries, we knew we needed an easy way to make it viewable where those clusters are. We did a lot of research and found that various services in AWS had regional or multi-region dashboards. Some had global consoles like EC2, and others had cross-account dashboards. But nobody, as far as we knew, had yet done a true global cross-account, cross-region dashboard. We knew if we had to solve this problem of figuring out what's going on across your clusters, we had to solve both of those things.

Earlier this year, we launched the EKS Global Dashboard, which gives you a centralized inventory of all of your clusters. It's not necessarily meant as a high-severity troubleshooting tool. It's more of an executive-level dashboard. You log in Monday morning with your cup of coffee and see what versions your clusters are on and who you have to go chase for upgrades. It's that type of workflow where you just need to understand what the state of all your Kubernetes clusters looks like.

Thumbnail 790

Observability and Troubleshooting: Enhanced Container Network Observability and MCP Server

We've continued to work on observability, making it easier to understand what's going on in your cluster and troubleshoot issues. The Monitor Cluster button that's now in the EKS console has a whole bunch of sections. You can see node health. One of the ones that I think is also kind of hidden but is super helpful is that we surface through CloudWatch Log Insights the top talkers to the API server. We see a lot of cases and tickets opened where it turns out some new deployment or policy agent got released in the cluster and started making thousands of list calls, which took down the API server. You run this query, and it makes it immediately obvious that it was some deployment that did that.

Networking—I'm sure Eswar and his team will agree with me—we've seen every possible way that Kubernetes can fail over the seven years we've been doing EKS. Most of the time, it's networking. The feedback we got from customers was that they'd rather not have to open support tickets. They'd be much better off if you can give them the visibility so they can troubleshoot themselves. It's often something going on in the environment where they need more visibility into what's happening. We spent a lot of time understanding those needs, and last week we launched Enhanced Container Network Observability.

Thumbnail 850

This is a single agent you run in your cluster that exposes metrics and ships them to CloudWatch. It's useful for proactive network monitoring, so you can understand whether you're nearing DNS packet limits or experiencing retransmission timeouts. All of these are ways that Kubernetes networking, I think, got a lot of things right. It's a single flat networking namespace with an IP per pod. When it works, it's great. When it doesn't, it can be hard to troubleshoot. We're really excited about this one. It's an easy way to turn it on, and you get out-of-the-box visualizations. For many years I've wanted to show the pods networking flow.

Thumbnail 910

So once you turn this on, you get a native service map so you can understand which pods are talking to each other. Oftentimes that's the first point when it comes to figuring out what networking issue is going on. The flow view lets you see within the cluster who's talking to who, and you can see both cross-AZ traffic and pod to service communication.

Today we support S3 and DynamoDB, and in a lot of cases ML training does a lot of communication to S3. We make that easy to understand. You can also see pod to traffic that's external to the cluster. Check this one out—it really helps with understanding what's going on in the networking observability space.

Thumbnail 960

Another recent launch comes down to troubleshooting. Troubleshooting is hard, but it's easier if you have the right context. We launched an MCP server earlier this year in preview, and a lot of the feedback was that this is good, but until you manage it for us we're not actually going to use it in production. So we launched last week a hosted version of the EKS MCP server, which is available in public preview.

Troubleshooting and getting started are the two first big use cases we have. For troubleshooting especially, we have seven years of EKS experience, and we've seen all the ways Kubernetes can fail. We have run books and troubleshooting books that are built into the MCP server. So if you ask it something that's going on, we can use that to look up the same knowledge that a support team might look up when you open a case. You can now do that yourself through this MCP server, and it's integrated with all of the tools you're used to—Q Console, and security of course, CloudTrail. The hosted version has all of those enterprise-grade features that you need to actually use it in production.

Thumbnail 1020

It's integrated into Q, so if you go into the EKS console now as of last week and you see something that's failed, whether it's a pod that's in a crash loop or some networking issue with the new feature I just talked about, you can now click and ask Q what's going on. It automatically integrates with the MCP server behind the scenes. You don't have to do that yourself, and this can be helpful to understand more quickly what's going on in your environment.

Thumbnail 1050

Production-Ready Features: Cost Allocation, Pod Identity, and Auto Mode

For an even deeper level of observability, CloudWatch Container Insights has come a long way in the last two years. We recently launched EBS metrics, more detailed GPU metrics, and application signals support. There are lots of observability tools out there that you might be using, but CloudWatch Container Insights is our native version that's the easy button. If you don't want to have to think about which metrics to scrape or which alarms to set up, just give me the opinionated version of a Kubernetes monitoring stack. This is going to be your answer.

When you move to Kubernetes, you get a lot of efficiency benefits. You're not running a single app per VM. You're getting bin packing where you can have multiple applications running on the same instance, and that has a lot of benefits. But it comes at the expense of cost visibility. It's hard to figure out with multiple applications running on the same instance who's actually contributing most to the cost.

Thumbnail 1090

We have several options there. We have a partnership with Kubecost, which is an open source tool, and we have an EKS version you can run. That's still an agent you have to run in your cluster. For the fully managed version where you can check a box with no additional cost, we now integrate with Split Cost Allocation Data. We announced this last year, but two of the biggest feature requests we heard following that were support for Kubernetes labels—so not just understanding at a namespace level what the cost is, but at an individual label level on deployments—as well as support for GPUs. These were two of the big feature requests that are now supported in EKS as of several weeks ago.

Thumbnail 1150

We launched Pod Identity last year to make it easier to connect or authenticate from your pods and deployments to AWS services like RDS, S3, or DynamoDB. This was honestly a lesson learned for us. Almost all of you are running multiple accounts. When you're running real world production workloads, you might have your applications in a separate account and your compute in another. Making Pod Identity work for cross-account was a big request once we launched that feature, and we made that work this year.

There are enhancements like session token support, so you can do things with Pod Identity like allowing a pod in a particular namespace to only read from an S3 bucket if it has the prefix that's the same as the namespace. You don't need to have ten different policies to do that. You can do that with a single policy using session tags.

Thumbnail 1210

Cluster deletion protection is an important feature for production environments. There are many mission-critical workloads running on EKS. A simple thing like an infrastructure-as-code tool bug—where you click plan and then click apply and it does something you didn't want—is a real risk that many customers face. We heard from customers that they have really critical workloads running in EKS, so we added cluster deletion protection, which is similar to what other services have.

Thumbnail 1240

Thumbnail 1260

EKS add-ons is a feature we launched a while back, but we've continued to expand it. This reflects the theme of wanting to run Kubernetes while having AWS take on more of the heavy lifting. We launched community add-ons this year, making some of the most common add-ons available through AWS. Add-ons like metric server and external DNS are now available for you as well. Another feature we just launched is backup support for EKS. For mission-critical and stateful workloads, we wanted to make it easier for customers to back up their data and show compliance to their auditors. AWS backup support for EKS launched a couple of weeks ago at re:Invent. It's agentless and fully managed, working across all your other AWS services that are already integrated with backup. It works cross-account and cross-region, so out of the box, we believe it will cover most of the use cases you need to back up.

Thumbnail 1300

When restoring, you can restore just specific namespaces, restore to existing clusters, or restore to new clusters, providing a pretty flexible approach. EBS is definitely one of the jobs of the product team—to make sure the rest of AWS is building Kubernetes-native integrations that work well. EBS is a good example of a team we've worked really closely with. They've launched recent features like enhanced data protection, faster initialization, and volume cloning. As soon as they launched these EBS features, they were available in the EBS CSI drivers. EBS is a good example of how we work closely with another service team to make sure their innovations and features can come to EKS as soon as they're launched.

Thumbnail 1340

Thumbnail 1350

Thumbnail 1360

EKS runs everywhere you need to be. We are a launch-blocking service in every region and every availability zone, and that will continue to be the case. We run anywhere you need to be, whether that's fully in the cloud or on premises, and we have a whole spectrum of offerings available to help you there. Last year at re:Invent, we launched Hybrid Nodes, our newest approach to supporting on-premises infrastructure. You can have the control plane in the cloud with worker nodes on premises, which is a better approach as long as you have connectivity to the cloud. This is different from EKS Anywhere, which we launched several years back as a fully on-premises, air-gapped approach to running clusters. Hybrid Nodes is an easier approach to running on premises if you have that connectivity back to the region.

Thumbnail 1390

Thumbnail 1400

Thumbnail 1410

Some of the features we launched this year include Bottlerocket support and expanded support for Cilium. Configuring networking on premises can be difficult, so we surface a lot of insights during that setup. Auto Mode was our other big launch at re:Invent last year. Auto Mode is our fully managed data plane. We've launched various features over the years—managed node groups, Karpenter, and EKS Fargate—and we've learned a lot. With Auto Mode, we believe we have the most Kubernetes-native, conformant, fully managed data plane that takes away the heavy lifting you need to do.

Thumbnail 1430

Thumbnail 1440

Thumbnail 1460

The way we did this is by working with EC2. Previously, without Auto Mode, you run EC2 instances and all the controllers in your cluster. With Auto Mode, we take on the responsibility for running the controllers—things like the EBS driver, Karpenter, and the load balancer controller. We run those on our side, while EC2-managed instances run in your account but are managed by us. This was an innovation we worked with EC2 to make happen. We've continued to iterate on Auto Mode this year. We added static capacity and advanced networking options, which was a big ask to support running pods in separate subnets. We also added region expansion and faster image pull. We'll continue to iterate on Auto Mode, and there's a link to the changelog where we're launching new features almost every week.

Thumbnail 1480

EKS Capabilities: Managing Beyond the Cluster with Argo CD and ACK

My last update before I hand it off is probably our biggest update to EKS really since we launched seven years ago. You get a cluster that's production-ready, but that's not enough to actually run your applications. How do you deploy applications to those clusters? Those applications generally need other AWS services. Maybe you need an S3 bucket or an Elastic Cache. How do you provision those and make it easy to connect to the workloads? We announced EKS capabilities last night.

Thumbnail 1510

This is our take on expanding beyond the cluster to managing the heavy lifting of the platform that everybody's building around Kubernetes. We started with both deployments, so we now have a managed version of Argo CD. Generally, our take with capabilities is when there's an existing community standard that the majority of our customers are using, we're going to take that and manage it for you. That's what we did with Argo. It's very loosely opinionated. If it works in Argo, it's going to work in this, but also where possible we're going to make AWS integrations that make sense and make things better.

Thumbnail 1540

Secrets is a pain point with GitOps. We added a native integration with AWS Secrets Manager to make that easier. Setting up credentials for GitOps can be challenging, so we have an integration with AWS CodeCommit. I think one of the most underrated innovations that we have with our version of managed Argo that you can't do when you manage it yourself is that we manage the networking and sync traffic behind the scenes. This works across accounts and across regions. You don't have to worry about network connectivity across those boundaries—that's handled by us.

Thumbnail 1570

The other one we launched is managed capabilities for ACK and KRO. These are two open source projects. ACK we launched several years back. This is a Kubernetes-native way to manage AWS resources, and then KRO is a way to build abstractions around those resources and publish those as APIs. We see this as really useful for a common pattern we see now. Rather than using traditional infrastructure as code with CloudFormation or Terraform, where a developer needs an S3 bucket, they go to a team, they open a ticket, they get their bucket, they come back, they hook it up. Now they define their infrastructure alongside their Kubernetes applications all in a Kubernetes-native way.

Thumbnail 1610

Thumbnail 1620

The customer experience works like this: as an EKS administrator, you're creating the capability, and as a developer, they're using the familiar tools they work with, just as if they were using them in the open source. Putting it all together, a self-managed platform with EKS looks like this. When you combine it with capabilities and auto mode, we're managing a lot more of the pieces for you.

Thumbnail 1670

Taking EKS to Unprecedented Scales for AI and ML Workloads

With that, I'm going to hand it over to Ishawar, who's going to talk about some of the large-scale innovations we've made this year to support running really large workloads on EKS. Thanks. Thank you, Mike. As Mike highlighted, Kubernetes' simplicity and extensibility made it the go-to platform for containerized workloads. Now, what I want to share is how we are taking EKS to unprecedented scales to meet the demands of AI and ML workloads. The innovations I'm about to show enable capabilities that seemed out of reach just a few years back.

Thumbnail 1680

Let's dive in. We observed three key developments that are shaping the Kubernetes and AI landscape. First, Kubernetes has become absolutely central to AI and ML operations. This is not by accident. Its declarative and extensible model, the robust orchestration capabilities, and the rich community tooling are exactly what makes complex AI workloads easier. The second development is about the scaling laws that we observe right now. Increasing model sizes correlate with model capabilities. We've moved from models with millions of parameters to hundreds of billions and even trillions these days. Every step in that evolution in model size demands a lot more from the infrastructure.

Thumbnail 1770

Which brings us to the third point: scale requirements have exploded. Modern AI training is not just about managing GPUs; it's also about making sure that you get the best out of all the other infrastructure like compute and storage. What we also observe is that customers never just run a training inside a single cluster that they have. They run a variety of diverse workloads in the same cluster to actually share the capacity. They run their inference workloads, and we also observe startups running domain-specific trained models to hide throughput inference services. This projection from Gartner is absolutely stunning: 95 percent of new AI workloads are going to run on Kubernetes, a dramatic increase from 30 percent today. What this means is Kubernetes is evolving beyond the general-purpose container orchestrator to become the de facto substrate for AI and ML workloads.

Thumbnail 1800

We launched EKS Ultra clusters in July, and we worked really closely with Anthropic, one of our customers, to address the challenges that they were facing in their critical AI/ML workloads.

Thumbnail 1820

While we initially developed this capability with AI in mind, what we are observing is that customers like to use this ultracluster setup for other workloads. These clusters can scale beyond close to 100,000 nodes, and they can harness up to 800,000 GPUs in a single cluster or 1.6 million Trainium accelerators. We are proud that we have achieved this scale while maintaining full community conformance. This means you can use your existing tools, infrastructure, and workflows without any modification, just at a larger scale.

Thumbnail 1860

This graph here illustrates what running AI workloads in practice means. We observe our customers running multiple distinct workloads. In our testing, we orchestrated three distinct workload types simultaneously. You see here we have training datasets, fine-tuning jobs, and inference worker sets. We have pushed the system to handle up to 100,000 concurrent pods, scaling up to 100,000 concurrent pods in a really short duration. You see the climb up in just a few minutes. You can see rapid scaling events where thousands of resources are provisioned at rapid time across all three different workload types.

Thumbnail 1920

The key point here is that managing the scale reliably is about maintaining consistent performance at any duration in the lifecycle of the workloads, and this is exactly how we have reimagined our control plane architecture to achieve this. Let's look at the EKS control plane. Today, the control plane stores and manages cluster state in etcd. At the heart of the cluster is etcd, managing both your cluster configuration. We actually run three etcd nodes in a single cluster, and it serves as the backbone for the entire cluster, storing all configuration data and the state of your Kubernetes objects.

It uses the Raft consensus protocol to ensure data consistency across all nodes. There is also an MVCC layer, which is multi-version concurrency control, that manages concurrent access to the data and provides the key isolation property if you are having multiple data writes or reads at the same time. The system here is anchored by BoltDB, which is an in-memory key-value store backed by storage as well. We also have a write-ahead log, which guarantees the durability of your cluster state. We have carefully engineered this setup to right-size as the demand requires, while also making sure that we have regular rapid backups to restore the cluster state if in an unlikely event the cluster goes down.

This architecture has served us well, but as you will see in the next slide, we have made some foundational changes to etcd to support ultra-scale operations. There are three innovations that I want to highlight here. First, we have the in-memory database, which is the BoltDB that I was mentioning. We moved it from network-attached storage to an in-memory TEFS-based solution. This shift delivers order of magnitude performance improvements in both read and write operations.

Second, we have partitioned the key spaces that are stored inside the cluster, and that allows hard resource types to be split into separate etcd clusters. Our testing shows this delivers up to five times the write throughput while preserving the durability and the rich API semantics. But the most critical advancement we have made is how we have offloaded the consensus management. We have moved from traditional Raft-based consensus to leveraging AWS's journal system, a technology we have been perfecting for over a decade now, and it actually underpins all of your favorite services that you can think of.

This allows us to scale etcd replicas without being bounded by quorum requirements. There is no need for etcd peer-to-peer communication anymore. While this also delivers ultra-fast auto data replication with multi-AZ durability, together these three innovations form the foundation that enables EKS to support ultra-scale clusters.

Thumbnail 2110

Reimagining the Control Plane Architecture with AWS Journal System

Here you see the new architecture in EKS. The key transformation, as I mentioned, is how we handle consensus. You see that on the left is the storage layer, the MBCC backed by BoltDB, and on the right is the replication layer, which is actually the consensus layer. We've integrated that with AWS's battle-tested Multi-AZ transaction journal. While we did that, we've actually maintained the familiar gRPC interface, which meant that we kept the same interface between the current etcd using Raft-based journal and the new Raft-based consensus and the new journal-based consensus. This means you get massive improvements without sacrificing compatibility or durability.

Thumbnail 2160

While we've enhanced the cluster control plane, we've also made significant improvements to boost your application's performance and reliability with four key advancements. First, we've introduced multi-network interface support for pods, enabling network bandwidth up to 100 gigabits per second. This is critical for AI workloads moving massive data sets. Second, we've implemented concurrent download and unpacking of images using our new container runtime based on our Soci image pool. This allows us to cut the container image pulls in half the time that it takes today.

For network efficiency, we've moved from individual IP assignments to your pods to prefix delegation that assigns a CIDR range for an instance at chart. This dramatically improves node launch rates up to threefold while optimizing your VPC address range. Finally, we've also launched auto repair capabilities for all of your compute, including the accelerated compute. We automatically detect and replace your unhealthy nodes to maintain consistent performance.

Thumbnail 2240

Let's talk about the real-world impact here. The adoption of GPU workloads in EKS has been remarkable. Since 2024, we've seen GPU instance usage with EKS more than double. While AI and ML remain significant drivers, we're seeing GPU adoption across various domains. We see scientific simulations, video processing at scale, real-time rendering, and high-performance computing.

While these are groundbreaking capabilities for massive workloads, we asked ourselves a fundamental question: how can we bring these performance improvements to all of our customers? I'm sure not everyone needs a 100,000-node cluster, but I think every customer deserves predictable, high-performance control plane operations, whether you're running your microservices platform or managing enterprise-critical applications. We are bringing the performance benefits of our ultra-scale architecture to all clusters.

Thumbnail 2290

Provisioned Control Plane: Bringing Ultra-Scale Performance to All Customers

I'm really excited to introduce the Provisioned Control Plane. It's a first-of-a-kind offering that brings the performance benefits of our ultra-scale architecture to all EKS clusters. At its core, the Provisioned Control Plane gives you the ability to select high-performance control plane scaling tiers with pre-allocated capacity. Think of it as shifting from on-demand capacity with auto scaling behind to a provisioned reserve capacity model for your control plane. It's particularly valuable for any workload that demands predictable performance, whether you're running ML training or you need consistent pod getting rates or operating a multi-tenant platform. You all expect the control plane to have better performance as your workload scales.

Thumbnail 2350

There are three key benefits to the Provisioned Control Plane. First, you can provision your control plane capacity, eliminating the uncertainty of dynamic scaling, which is what happens today. The cluster control plane scales up and down based on the workloads you're putting on the cluster. This means you'll have consistent, predictable performance for your critical workloads. No more worrying about control plane scaling delays during critical operations.

Second, you gain access to increased compute capacity. We're talking about multiples of standard performance levels. You get capabilities like processing up to 6,800 concurrent API requests in our highest tier. Third, you can set up the control plane tiers to handle unexpected demand spikes, and this is particularly valuable when you're planning for high-traffic events. We have designed these options with flexibility in mind.

Thumbnail 2400

As you can see, EKS now offers two distinct control plane modes. On the left we call standard, and on the right we have the Provisioned Control Plane with multiple tiers available.

Thumbnail 2440

Thumbnail 2450

You can switch between them based on your needs. In standard mode, which remains our default option, the control plane scales up and down as you apply load to it. There is some delay between scaling across the run levels we have today, and this is perfect for general purpose workloads. With provisioned mode, you pre-allocate specific tiers for guaranteed capacity. As your workload requirements evolve, you can adjust your control plane configuration accordingly. It is very easy to create a new cluster or update an existing cluster to provisioned mode. You pick the capacity tier during creation, or you can actually do it during the update.

I want to dive into the specific capabilities each tier offers for clusters. Each tier is engineered around three critical aspects that directly impact your workload's responsiveness and stability. First is API request concurrency, which determines how many operations your cluster can handle simultaneously. Think of this as your cluster's ability to multitask, whether you are rolling out deployments, scaling applications, or handling health checks. Higher concurrency means faster, more responsive operations.

Pod scheduling rate is crucial for your workload agility. It is about how quickly your cluster can respond to scaling events or recover from disruptions. This becomes particularly vital in any machine learning workflows where you need to rapidly orchestrate training jobs or scale inference services. The cluster database size is the third dimension here. It ensures you have sufficient headroom for your application metadata, which is stored on the etcd side. This typically includes your config maps, secrets, and resources like pods and namespaces. Each tier maintains a 16 GB cluster size, which we have found optimal for most workload patterns. These tiers represent carefully engineered performance levels that match real-world operational needs.

Thumbnail 2540

We made it really easy to consume these capabilities. With just a simple CLI command or click in the console, you can configure your desired control plane capacity. I encourage you all to try it out on your new cluster or an existing cluster. Now that I have shared how we are pushing the boundaries of scale and performance with EKS, I am excited to introduce Neil Mullen here, senior director of cloud infrastructure at Netflix. Netflix operates some of the world's largest container platforms serving hundreds of millions of customers globally. They have been at the forefront of cloud native innovation, and their journey to EKS is a story of evolution and scale. Neil, over to you.

Netflix's Migration Story: Making Scale Somebody Else's Problem

Thank you. Hi everyone. My name is Neil Mullen. I lead what we call cloud infrastructure engineering at Netflix. So this is a story about how moving existing large infrastructure to managed services or having someone else do some of the heavy lifting can be accomplished. I am going to talk a bit about our background at Netflix, what makes us unique and different and challenging to do this with, and our journey to EKS over the past two years.

Thumbnail 2620

Thumbnail 2630

Thumbnail 2640

Thumbnail 2650

Thumbnail 2670

We have a lot of compute at Netflix. We are super dense compared to most AWS customers. We have 300 million paying customers that we have to run a website for, which is more like a billion user profiles or users. That takes a lot of compute. We have large scale personalization to make all of those predictions appear for what you want to watch before you even know what you want to watch. That takes a lot of compute. But even that is dwarfed by the thousands of hours of video we are shooting every week in 8K that we are converting to run from everything from your 10 year old Roku device to the latest 4K TVs. Not only is it just this week's video, we are constantly re-encoding the entire back catalog of Netflix to use the latest codecs and to be rendered in all those resolutions. In addition to that, we have a rapidly growing ads business and a burgeoning gaming business, each of which are large sources of compute. So we have a lot of compute at Netflix. From talking to peer customers or other large customers at AWS, the proportion of our spend that is compute-based is 50 to 100 percent higher than many other large customers.

Thumbnail 2690

Let us talk about the setup at Netflix. We came to the cloud 15 years ago. A majority of compute at Netflix still runs on a very mature, very well-run EC2 workflow that enables developers to build Java and Node services primarily and run them directly on EC2.

Thumbnail 2710

A large minority of compute, however, runs on a system called Titus that we delivered about 8 years ago. It's a majority of actual service count at Netflix, so most new code at Netflix gets written by default on this platform. Titus is a large-scale multi-tenant container platform that effectively does container as a service. We originally built it on Mesos, and we shifted it to Kubernetes seamlessly under the hood about 5 years ago. That's the scope of what we're going to talk about today in terms of what we took EKS to work with.

Thumbnail 2740

Thumbnail 2750

However, there are a lot of key differences as to why we're even more complex than that. We run 4 core streaming regions, plus a couple of additional thin regions for some of our encoding and gaming use cases. We run a regional availability model, which means when we have problems, we flip an entire region in about 5 minutes. In the middle of October, AWS had a bad day, which happens about once a decade or so. We had a bad 15 minutes and we were out of US East 1.

Thumbnail 2770

Thumbnail 2810

Secondly, we have the concept of the trough. There's a diurnal cycle in almost everybody's services, which can be as much as about 45% delta from the peak to the trough we find in many services. That's basically hundreds of thousands of CPUs that are sitting idle at different points of the day. You have two choices: you can either judge what's the sweet spot for how much you purchase on demand versus how much you purchase as reserved instances, or you buy the whole lot in reserved instances and you pack every cycle of that with all of that time-insensitive work I described earlier. We run about 97% of our reserved instances as utilized doing something. I'm not saying we run them all well, that's a different argument, but we do. The main point is we are moving a lot of compute around a lot of the time, more so than perhaps anybody else.

Thumbnail 2820

Let's talk about some of our scale. We have 4 primary regions with hundreds of thousands of containers in each of those regions. But when we talk about what we have to plan for in adopting a service like EKS, it's not just the steady-state launch rate and it's not just even the region evacuation launch rate when we're getting out of US East 1. It's when 100 million people are watching Mike Tyson get his lights punched out and we have to get out of US East 1, and that's what we have to plan for.

When we first talked to EKS about this 4 years ago, or 3 years ago, I think it was 3 years ago. The answer was no when we sketched the kind of launch rates we need. I have to give kudos to Pinterest, who brought EKS on a journey through 2023 that brought us to the point where 2 years ago it was, well, we don't need to double what you guys can do. So we started on our journey.

Thumbnail 2900

We don't do cluster as a service. We're different on that front as well. We run less than 20 production clusters in our four regions. We believe that the high percentile scale and availability story of running very large clusters, multi-tenant on very large boxes works better. But that makes us weird and different for a service like EKS that has to cater to hundreds of thousands of different customers and all of their oddities. We have up to 10,000 machines, big ones, 24 and 48 XLs per region. We have about 80,000 containers or pods, which we use pretty interchangeably for how this is configured in each of those clusters.

Thumbnail 2910

etcd is still running north of 5 gigabytes, though it was more than twice that at one point. We've had to do a lot of work to bring that down. That is one of the sets of changes we have to do. We run launch rates of 70,000 containers in a 5-minute period when we flip regions.

Thumbnail 2940

So why EKS if we have all this already? One of our core engineering principles at Netflix is build only when necessary. We built all of this stuff that I've described over the last 15 years because it was necessary, because there was nothing there to do things at the scale we wanted to do it. We still want to build lots of things today, but we don't need to build what others can provide. We want to offload the undifferentiated work and make scale somebody else's problem, in this case AWS's problem.

Thumbnail 2950

Let's have a quick look at what we did after that decision. I'm not going to go into the details of this slide, but I'll talk through what we had to do. The green pieces were the integrations we had to do with EKS. The red pieces were all of the services and the existing code of ours that we got to delete as we moved the control plane to EKS.

Thumbnail 2970

Here's the one text-heavy slide that we have to read through. This is the list of what we spent our time doing. We worked for about a 9-month period with EKS to work through getting their scale to where we needed it and doing these integrations. We had to use EKS itself. We had a lot of changes to make around our regional control planes, which are consolidated in an account to give EKS its own space, its own control plane, and its own VPC so we could isolate it. We have changes to our identity model. We have our own identity model which works both inside and outside AWS. We had to integrate with IAM for the EKS control plane integrations. And we had some integrations with CloudWatch to do as the Kubernetes logs go into CloudWatch, as well as Prometheus integrations to support that.

Thumbnail 3020

Overall, we were able to work through this with a fairly small engineering team. We took an aggressive goal that we were going to end up with a system that as you recreate a container, it creates on the new one. We decided to migrate the entire fleet in one quarter.

Thumbnail 3030

Thumbnail 3050

Getting long-term metrics at Netflix is another challenge of mine, but it's a pain. This is the tail end of that migration. We ran a little over a quarter, about 11 days over, and we migrated every last container from our existing system to launching on the new EKS control plane. There's not a lot of pain in this story, but nothing ever goes smoothly. We did find the limits. EKS doesn't like 120,000 pods inside a single cluster. At least it didn't back in March. I'm excited to see the hyperscale announcements that have been talked about today. I'm going to go and test out some of those limits.

What we find at our mutation rates is that the numbers of pods EKS supports are a little less when we're spinning the pods at the rate we are. It was so successful that we also run a federation layer above those clusters, where we choose which pods go into which cluster. Some of them need to be in the same cluster, some want to be in different clusters for availability regions. We've also moved that federation layer to EKS ahead of schedule because of how smooth this experience was.

Thumbnail 3090

Thumbnail 3110

We spent this year evolving the Titus data plane to be less bespoke. It consisted of a virtual kubelet and like 60,000 lines of custom code. That was a bit too much to bring to AWS or to EKS. So now it's running on the stock kubelet, and we're going through another one of those migrations, which we're racing towards year end to try and be done, and we'll probably be about 11 days over again. But that opens the door to an EKS data plane migration in the future.

EKS Hybrid Nodes open the door for use cases we have, where we're seeing more and more edge use cases for our gaming and increasingly heterogeneous encoding use cases. EKS Auto Mode allows us potentially to get rid of having to look after the OS for many of these use cases. These are all the things we're looking at in this coming year. We're also experimenting with thinner containers. If any of you can do the math on some of those earlier slides, you'll figure that our containers are big, spanning many CPUs in many cases, but we're seeing more and more thin container use cases.

Thumbnail 3180

This is what we're going to work on in 2026, and hopefully we'll come back and tell a story about how we migrated the data plane too to EKS in future years. Thank you for the time and the opportunity to talk here. Looking forward to you continuing to push the boundaries of what we can support at scale. So finally, with 7 minutes to go, what's coming next? We've done a lot of work this year, but we love unsatisfied customers and we know we have more to do.

Thumbnail 3190

The Road Ahead: Using Kubernetes Without Operating It

This isn't an exact quote, but it's one that I've put together from an aggregation of talking to hundreds of customers over the last year or two. Whether it's newer Kubernetes adopters or even existing customers who are early adopters of Kubernetes, they're realizing that it's not really useful to their business to tweak every last add-on, every last setting, own every last thing. It's the same refrain: I want to use Kubernetes, it's become the standard. I want to use the tooling. I don't want to think about clusters, upgrades, or any of the hard parts.

Thumbnail 3220

Thumbnail 3230

There are pure technology companies out there whose goal is to solve these hard problems at scale, and we have customers who also need to solve these hard problems but they have different focuses. You might be in healthcare, gaming, pharmaceutical, airlines—there are all kinds of different industries out there. You don't necessarily have the time to focus on running and managing large open source projects yourself. How do you use technology without becoming a pure technology company? I think every company these days wants to say they're a tech company. But that's not really what the end core business focus is of a lot of these companies out there.

Thumbnail 3260

Thumbnail 3300

You come to a conference like this and you're seeing a lot of projects, innovations, and you wonder how do you go back and make use of that yourselves. Open source software, contributing to several small projects, running at small scale with open source, doing it yourself—that's doable. As soon as you start getting to 20 open source projects, putting that together, running at scale, you need a lot of time and expertise. What we think about at EKS is taking open standards, combining that with AWS, and accelerating your time to value. It's using all of EKS without having to manage these open source projects yourselves.

Thumbnail 3320

We're lowering the cost of entry to run these projects. Every type of workload runs on EKS these days. We just talked a lot about AI/ML. You heard Neil talk about streaming and encoding, gaming, web applications, data processing. You name the workload, it's running on EKS, and so we have a very broad and diverse customer base with different types of workloads. We're going to make sure our service works for all of those, of course. AI/ML is the hot topic these days, but there are lots of other workloads, stateful workloads in particular. Spark, Flank—we want to make those easier. How do you upgrade stateful workloads?

Thumbnail 3350

Thumbnail 3370

EKS, I think we've shown this slide in a few years. We started with just the managed control plane. That's the really hard part of Kubernetes—running at scale, scaling, patching API servers. It's not easy to do, and that's where we started back in 2018. Over time, we've moved beyond just clusters into hybrid running compute outside of AWS and managing more of the platform components or EKS capabilities, which I talked about this year. Our launches this year will continue to take on the heavy lifting of some of those components that you're running in addition to the cluster and then the developer experience.

What if there was a world where you could just give EKS your application manifest? You didn't even have to create a cluster in the first place. You just gave us the application you want to run. Let us do the heavy lifting, like Neil just talked about—a federated layer above all of their clusters that they've built to figure out where pods go. We'd love to do something like that so you can get Kubernetes without even having to think about clusters.

So what are our priorities for the next three years? One is critical workload patterns at any scale. That's really what we've talked about today. We just talked about all the different workload patterns. The scale requirements just keep getting larger. We're going to have to look at splitting out across multiple clusters because at a certain scale you just run into the limit of how large a single cluster can get. How can we make running workloads across multiple clusters easier?

AWS integrations—this is a large part of my and Ishwar's job: going and working with the other teams within AWS and making sure that they're building the right things. Kubernetes is really the front door to the cloud for a lot of customers we talk to. They're not necessarily going through 50 AWS services; they're going through Kubernetes. The EBS driver provisions EBS. The S3 driver provisions S3. They're using AWS through Kubernetes, and it's really important for us to make sure that these other services across AWS work well for Kubernetes-native customers. That's a big part of it. You'll see us working with other teams, with more integrations coming.

Meeting your workloads where they are—that slide I showed earlier shows everything from EKS Distro, which you can take and run yourself anywhere possible, on a jet, on a cruise ship, EKS in the cloud, and anywhere in between. We're working on improving our story of managing clusters on Outposts to supporting the new SKU and server types of Outposts that are coming. We will continue to make sure we meet your workloads where they are.

I wrote simplify platform building here. It may be a little controversial to say eliminate platform building. We want to just launch more and more managed capabilities so you don't have to run or need huge platform teams in order to use EKS. We really don't think you should need massive teams to run Kubernetes. Use Kubernetes without having to operate it.

Thumbnail 3540

And then lastly, accelerating the flywheel of innovation. We continue to work in the community, open source projects. Generally, I'd say our philosophy is if there's an existing standard in the open source community, we'll take it and we'll run with it. We did that with the launch of Managed Argo that you saw this year. Something like Karpenter was one where we went out into the community. There was an existing standard cluster autoscaler. We thought we could do better, and it's a big decision to say there's an existing standard and we're going to build an entirely new project. There's a meme that says there are 16 standards, we're going to go do the new one, now there's 17 standards. But in this case we actually built the new standard. Nearly every cloud provider supports Karpenter. And then there are other cases like ACK, AWS Controllers for Kubernetes, which is more of a just open source AWS-only project, but there are reasons to open source that. So we'll continue to work in the community.

Thumbnail 3590

Thumbnail 3600

Our public roadmap, I get emails every time somebody makes a comment on here, so I check this every day. It's a way to get feedback to us. There are more sessions coming. If you want to learn more about EKS capabilities, if you want to learn more about container network observability, and if you want to go even deeper than what Ishwar talked about today on the high scale ultra-scale performance, I believe that one's a 500 level, which is one of the few 500 levels. So if you want to get really deep on some of the architecture changes, go check out that one.

Thumbnail 3630

And yeah, some resources: best practices guide, blueprints, and that's it. Thank you very much.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)