DEV Community

Cover image for AWS re:Invent 2025 - Generative and Agentic AI on Amazon EKS (CNS344)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Generative and Agentic AI on Amazon EKS (CNS344)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Generative and Agentic AI on Amazon EKS (CNS344)

In this video, Christina Andonov and Chris Splinter from AWS explain how to run AI agents and inference workloads on Amazon EKS. They cover deploying agents using frameworks like Strands with MCP servers, selecting appropriate GPU instances (G5, G6, GB200, P6) based on model size calculations, and optimizing infrastructure with Karpenter autoscaling. Key topics include EKS Auto Mode, fast container pulls with seekable OCI, node health monitoring, ALB Target Optimizer for inference routing, and the EKS MCP server for AI-assisted cluster management. They emphasize that running AI workloads on Kubernetes uses familiar tools and architectures, with specific optimizations for GPU utilization and cost performance.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Thumbnail 30

The AI Revolution: Navigating Unprecedented Change and Anxiety

You have probably heard of generative AI lately. Most likely you hear about it all the time because we do live in the AI technology revolution. For those of us old enough, we have actually lived through many technology revolutions in the past. When I was a kid there was no internet, phones were not smart, and clouds were certainly not made out of Linux servers. And yet here we are, where all these technology revolutions have changed the world forever as we know it to the point where every single organization today has a website, a smartphone app, and some cloud presence.

Here we are in the AI technology revolution. Like all the other ones, this one will change our world forever as we know it. Yet this one feels so much different than anything we have experienced in the past, doesn't it? Why is that? Well, for one, it is because everything is happening so fast. We have been really feeling that fast pace of innovation since 2023, and if you think about it, it is still 2025. We are still here.

Thumbnail 60

Thumbnail 100

Thumbnail 110

Thumbnail 120

Not only is everything happening so fast, there is not just one thing happening. If we were to zoom in, there is a new and better and faster foundational model coming up roughly every two days. There is model inference and agentic AI and MCP, and I do not know what next week is going to hold. The human brain is not designed to ingest that much change in that short period of time. If what you are feeling right now is anxiety, well, congratulations, you are a human and the rest of us are right there with you.

My name is Christina Andonov. I am a Senior Specialist Solutions Architect at AWS, and in today's session I will give you enough directional clarity to navigate this overwhelming technology revolution. If you were to step out of this room with your AI anxiety down by one notch, I will call that a success. Let us get started.

Thumbnail 170

Choosing Kubernetes for AI: Control, Portability, and Unified Platform Benefits

When you come to AWS and you have a business use case in mind to bring to life, you encounter our portfolio of agentic AI and inference services. The first question that you have is which one do I pick? And the answer to that is always the same. It depends. It depends on your preferences and it depends on your business requirements. But here is how you choose.

Thumbnail 220

The more to the left you go on this chart, the higher up the stack we manage for you. The more to the right you go on this chart, the more control you have over the underlying infrastructure. We see customers choosing Kubernetes for both their agentic and inference workloads for three primary reasons. The first reason is exactly that control over the underlying infrastructure. Because with that control comes the ability to tune and optimize your workload so that you get the best cost performance out of them.

The second reason is because it is Kubernetes and it is portable. It runs on multiple clouds. It runs on-premises, and we have heard loud and clear that portability is top of mind, more so with agentic AI workloads than it was in the past with business applications. The third reason is that Kubernetes is your one-stop shop platform for all your workloads: your business applications, your agentic applications, your inference, and fine-tuning workloads.

Thumbnail 280

When customers approach this, they usually come from one of two directions. Option one is they start with the model, they want to inference and fine-tune it, and then they progress to agentic AI. Or they start building their agents and later on decide whether to run and fine-tune the model on Kubernetes.

Thumbnail 310

Thumbnail 320

Thumbnail 350

We're going to take the latter in this talk, but you can choose whichever one you'd like. In the first section, we're going to cover how to run your AI agents on EKS. Gartner predicts that by 2028, 33% of all enterprise software will have agentic behavior and that 15% of all day-to-day work decisions will be made by agentic applications. In other words, agents are the new business application with one key difference. Whereas traditional software can solve problems where deterministic behavior is required, AI agents can solve problems where reasoning is required.

Thumbnail 370

Thumbnail 400

Building AI Agents on EKS: From Weather Tools to Production Architecture

For this example, we are going to get a travel agent up and running on Kubernetes. We're going to start with the weather agent. We want to get to a point where this agent gives us an itinerary of activities we can do based on the weather. Alice here would ask what's the weather in San Francisco because she wants to go to San Francisco. The first thing that any agent has to do is talk to an LLM to figure out whether it should answer the question straight up or if it needs more information to answer that question.

Thumbnail 440

Thumbnail 450

How agents talk to LLMs or access additional information is through agentic frameworks. Agentic frameworks are simply Python libraries. There are many open source Python libraries out there. We're not going to cover them in this talk, as there are plenty of talks that cover them. For this talk, we're going to use the Strands agentic library that AWS open sourced back in May of this year and has used internally for a few years before that.

Thumbnail 480

Thumbnail 490

With Strands agents, these frameworks are just libraries. It takes just 5 lines of code. You import the Strands library, you instantiate your agent, and you ask the question. As you can see, this is just Python code. Like any code, you can containerize it and push it to your container registry and deploy it on EKS. Now that your agent is deployed, Alice can ask what is the weather in San Francisco. Maybe you need a chat interface for her to ask that question, and the agent will say in December the usual weather in San Francisco is highs of 60s and lows of 50s. However, it doesn't know what the weather in San Francisco is next week because it needs access to the weather forecast.

Thumbnail 520

Thumbnail 530

In order for this agent to give you the weather next week, it has to talk to a weather API. How agents talk to external APIs and databases is through tools. Tools are just capabilities of these agentic libraries. For example, in the Strands library, you will import the tool capability. On the left-hand side here, I have a traditional application that has a function that calls a weather API. To turn that traditional program into an agentic one, you're just going to simply put one line of code at the tool in front of that function. What the agent will do is decide whether to use the tool or not use the tool.

Thumbnail 570

Thumbnail 590

So now if Alice asks what's the weather in San Francisco, the agent can determine whether we need the weather now or next week, so it's going to go and call the weather API. It is a good idea to go one step further here and wrap that tool in an MCP server. The reason that you would use an MCP server is because maybe if you start building more than one agent, all these agents have to talk to the same API, and it's a good idea to consolidate all API calls and MCP server database calls in their own MCP servers.

Thumbnail 620

Thumbnail 640

How you do that is you import the MCP2 library, which integrates with pretty much any agentic framework out there. You split out the tool, wrap it in an MCP server, and then on the other side, you put in the code for the client to call that server.

Thumbnail 650

Thumbnail 660

To build a fully fledged agent, you just need a couple more things: authentication that you can use auxiliary AWS services to do that. You can use Amazon Cognito to build your authentication layer. You can use Amazon S3. It integrates with Session Manager and you can use Amazon DynamoDB to store your long-term and short-term agent memory.

Thumbnail 680

Thumbnail 710

The last thing before you put this agent in production and actually while you're developing is to put some observability. The three pillars of observability that you're very familiar with from your business applications apply here as well: logs, metrics, and traces. There's one difference though. Traces here become more important than usual because of that non-deterministic behavior. Traces are your go-to place to see what path that agent took.

There are two open source libraries that have become very popular for you to instrument your agents with: Ragas and LangFuse. With Ragas, you can check how the quality of the response of that agent is. With LangFuse, it can feed those logs and metrics into traces and it will give you the round trip and a few other nice metrics.

Thumbnail 770

We covered the libraries, we covered the protocols, we covered the observability, and I hope by now you notice that there is a trend. You don't have to learn a whole new universe. It is very familiar to what we already know. In fact, if we were to take that agent and put it in EKS, we can have a very familiar architecture diagram for the agent. This here looks like pretty much any other service that you would run on Kubernetes, meaning you're going to use the same pipelines, same tools, same everything that you've already built.

Thumbnail 800

Running Models on Kubernetes: Selecting the Right GPU and Capacity Strategy

As I mentioned earlier, you start asking the question: well, when should I start running the model on Kubernetes as well? And the answer, of course, as with all the previous answers, it depends. But we see customers fine-tuning and training their models on Kubernetes for three primary reasons. The first reason is that the model for your use case was trained on the generic knowledge of humankind, very common knowledge. You in your organization have some very business-specific deep knowledge. If you were to take an open source model and augment it with that custom knowledge that you have, chances are you're going to get much better responses out of that model.

Thumbnail 880

The second reason for running the model on EKS is the law of physics. When you put the model next to the service, you get lower latency between the service and the model. The last, the third reason is that at scale, as I mentioned, you can tune and optimize your underlying infrastructure so you can get the best cost performance out of it.

We'll cover both inference and fine-tuning workloads. Actually, we're going to cover mostly inference workloads and I'm going to pinpoint the differences of fine-tuning. The first difference is that the workload pattern of these two is different. Inference workloads have a variable pattern with your traffic, whereas fine-tuning workloads are a job and you need a steady capacity.

Thumbnail 930

Thumbnail 940

Thumbnail 950

One thing that these two workloads have in common, unlike agents, is that they run on GPUs. When I think about GPUs, I think of tomato plants and what tomato plants need to thrive, and what you need to build a lush tomato garden. You need to pick the right tomato seed. You need to plant it in the right soil. You need to give it proper care, and you need to figure out how to do all that at scale.

Thumbnail 970

Thumbnail 980

Thumbnail 1000

Starting with picking the right seed, in other words, how do you choose the right GPU for the job? You start with the model of your choice that you want to either run inference on or fine-tune. This is a model on Hugging Face. You go and check the size of the model. In this particular case, this model is 40 gigabytes. Then we're going to do some back-of-the-napkin math. We're going to add a few gigabytes for KV cache and token generation memory, and we're going to pad it with another one or two gigabytes. This is a rule of thumb for inferencing, and that will account to roughly 45 gigabytes of GPU memory that you need to run this model.

Thumbnail 1060

Our G6 family instances start at 48 gigabytes of GPU memory, so you can fit this model easily on a G6. If you would like to fit it on a smaller G instance or another generation, you can quantize it, which will cut the 45 gigabytes. Let's say you quantize it and you can reduce it by half. That would be about 22 to 23 gigabytes, which will fit on a G5 instance that has 24 gigabytes of memory. For fine-tuning, you're going to do similar math, and using quantization techniques such as QLoRA, you can also reduce the GPU memory you need and fit it on a G instance.

Thumbnail 1080

In other words, from our portfolio of services, we can choose the G5 and G6 instances to inference or fine-tune this model. Once we have selected the instance or the instance families that we want to run, the next question is how do we purchase them? How we purchase them is somewhat similar to how we purchase any other instance. On-demand, savings plans, and spot all apply here. We have heard that capacity has been tricky in the past. Because of that, if you have production workloads, you might want to make sure you have that capacity. You can do on-demand capacity reservations. On-demand capacity reservations integrate with your savings plans and on-demand, so you just have reserved capacity and you can cancel those reservations whenever you're finished with the workload. Then you can do capacity blocks. Those are prepaid capacity blocks from 24 hours to 28 days, and you can reserve that capacity.

Thumbnail 1150

Thumbnail 1180

Thumbnail 1200

Planting GPUs in EKS: Karpenter, Storage Optimization, and Two-Minute Targets

After picking the right seed, we said we're going to plant that seed in the right soil, meaning putting it in the EKS cluster and the networking. There are multiple ways that you can hook up a GPU to an EKS cluster, and we recommend two of them, and I'm going to cover both. The first one is Amazon EKS Auto Mode. Auto Mode comes with our open source Karpenter autoscaler already managed out of the box for you. We run it on AWS infrastructure where we run the control plane for you, so you can just spin up a cluster. It comes with the Karpenter API available. The second option is for you to manage open source Karpenter yourself and install it in the cluster.

Thumbnail 1250

Once you have Karpenter installed in the cluster, the next step is to create a Karpenter node pool. With the Karpenter node pool, you can specify whether you want to use spot or on-demand instances for inference. What Karpenter does at this step is check all the availability zones at its disposal and knows the spot pricing in each availability zone. If you specify that you want a GPU that is G series and greater than generation 4, it will check the spot prices for all the G5s and G6 instances and bring the most cost-efficient instance up and ready for you.

Thumbnail 1270

Thumbnail 1300

To do this with open source Karpenter, you need a couple of things to be different. Auto Mode's default node class supports GPUs out of the box, so there is no node class you need to install on Auto Mode. With open source Karpenter, you should bring up an EC2 node class that is a GPU node class, and we highly recommend you use one of our EKS Accelerator AMIs. Auto Mode uses the Bottlerocket version of these AMIs by default. They come with everything you need for the GPU to be hooked up to the cluster. For open source Karpenter, you have a choice. You can use the Bottlerocket version or the AL2023 version. For AL2023, you need to install the device plugin at this time.

Thumbnail 1350

Thumbnail 1370

Thumbnail 1380

Thumbnail 1390

Once you have the cluster with the node pool ready to go and you know which model you want to inference or fine-tune, you want to get this model as close as possible to that cluster. You can put it in one of our storage solutions, such as Amazon S3. A good recommendation here is to use an S3 VPC endpoint so the model gets downloaded through the internal network. The same applies to the vLLM container image. You want to make a copy of that image as close as possible to your cluster, which is Amazon ECR, the Amazon Elastic Container Registry, and you want to use an ECR VPC endpoint.

Thumbnail 1410

Thumbnail 1420

Thumbnail 1430

Thumbnail 1440

To run that model, you need to create a deployment pointing to the image and specifying where the model is located. When you apply this with kubectl, Karpenter will spin up a GPU. The image will download via the internal network. We have some optimizations that allow us to get that image downloaded rather quickly. Then the model will load into the GPU. This process is what our customers always want to optimize. One more optimization you can do is use an open source project called run:ai in order to cache the weights and streamline them into the GPU.

Thumbnail 1460

Thumbnail 1470

For inference, you want the cycle from when you have an instance up, downloading the image, and loading the model to fit in under two minutes. Why two minutes? Because of the variability of traffic, you want to be able to utilize our spot capacity. Our spot interruption notice is two minutes, which means you want to have the next instance up and running within two minutes. For fine-tuning, because you have the capacity already there and instances are up, you still want to optimize how fast the container loads and how fast the GPU loads. If you optimize that, chances are you are also optimizing the checkpointing because you are optimizing the storage layer.

Thumbnail 1530

Thumbnail 1550

Caring for GPUs: Monitoring, Auto-Repair, and Scaling Inference Workloads

Because you're optimizing the storage layer, we have GPUs planted in the cluster. The next question is how we care for these GPUs. First, we need to make sure we observe what they're doing and how fast we can get a token out of the model. We observe our throughput—how many tokens per second that GPU can handle. We should also monitor the health of those GPUs, including temperature and power. GPUs are hardware and, like other hardware, maybe even more so considering the temperature requirements, sometimes they fail. When they fail, you need to take remediation steps.

Thumbnail 1580

Earlier this year, we announced node health monitoring and auto-repair for GPUs. Auto Mode comes with node health monitoring and auto-repair ready for you and configured so that if there is a hardware failure on the GPU side, it will automatically detect it and take a remediation step. Ten minutes after that, it will either restart your GPU or replace it. These are open source projects. If you're not using Auto Mode, I highly recommend you install them, tune them, and manage them in your clusters so that your GPUs can be well taken care of.

Thumbnail 1630

Thumbnail 1640

Once you have all that down, the question is how do I scale my inference workloads. For inference workloads, let me take a step back here. Usually with Kubernetes, how Kubernetes scales CPU workloads is that out of the box, the Horizontal Pod Autoscaler integrates with the CPU and memory metrics. This means I can tell the HPA to scale up when you reach a certain CPU threshold. There is no out of the box metric yet to scale on GPUs. Instead, you can have a custom metric and use an open source project called Karpenter and pass that metric into the HPA to scale up and down.

Thumbnail 1690

Thumbnail 1720

Thumbnail 1730

Thumbnail 1740

Thumbnail 1750

Or if you're scaling, you probably are already using inference frameworks to distribute the load over multiple GPUs. The most common inference frameworks we see are vLLM, Ray, and Dynamo, and all of these frameworks already come with a metric. You can take that metric, pass it to the HPA, and scale your GPUs up and down. You can expose the model endpoint with some of our services. A few other things you should monitor here are exactly this: up and down scalability and utilization and how well that behaves. Now that you've found out how to choose the right GPU, add it to the cluster, care for it, and scale it up, I think what you've got out of this is that EKS and the AWS ecosystem give your GPUs the best growing conditions. With that, I'll hand it off to Chris to cover recent launches.

Thumbnail 1790

EKS AI Workload Trends: Millions of GPUs and New Instance Support

Thanks, Christina. Everybody hear me? Thumbs up. All right, cool. Hey everyone, I'm Chris Splinter. I'm on the EKS product team and my focus on the EKS product team is helping customers run AI workloads on EKS. I have to start by saying that the thing I love most about my job is getting to work with folks like you and customers on all of the really cool things that you're building on EKS.

Thumbnail 1820

In these coming slides, I'm going to cover some of the trends that we see across various customers and I'll also touch on the recent launches that we hope can make your work lives a little bit easier. To start with a level set, while EKS itself is not classified as an AI service, we see an incredible amount of AI workloads running on EKS. The stat that you see on this slide here is that every week we have millions of GPU powered EC2 instances running in EKS clusters, and that metric has more than doubled since 2024. This really shows the affinity between Kubernetes and this whole AI adoption trend that we're seeing in the industry at large.

Thumbnail 1890

The fact that Kubernetes is at the center of this AI trend really isn't that surprising. If we look back over the last 10 to 20 years, a lot of the innovation has happened in open source. Many folks at AWS, myself included, believe that Kubernetes is very well positioned to be that foundation for AI workloads going forward because of its open source roots, because of its vibrant community, and of course also because it's a great technology with its extensibility and the fact that it's massively scalable. Gartner seems to agree, and they predicted that by 2028, 95% of new AI workloads will run on Kubernetes, a metric that's up from less than 30% today. That's a lot of AI workloads running on Kubernetes, and I'm excited to see how customers use this going forward.

Thumbnail 1920

I wanted to share just a few of the customers that are running AI workloads, so inference, training, fine tuning, as well as agents on EKS. You'll probably see a few familiar names up here. If there's one thing that I've certainly noticed, it's that this AI adoption trend is really affecting all industry verticals, all customer sizes, and all use cases, and I think that's only going to increase as we go into the future. From the EKS side, it's been both a privilege and also really challenging to run some of the world's largest training and inference workloads.

Thumbnail 1960

Running AI on Kubernetes is not all roses, and it certainly does come with challenges. The top challenge that we hear frequently from customers is just the ability to get their hands on the right GPU in the right region with the right size at the right time. We have folks working every day at AWS to solve this problem, and it is something that we very much look forward to improving as we go forward in the future. There are also unique cost optimization and GPU utilization challenges that come with running AI workloads on Kubernetes, and I think these two challenges go hand in hand. As you drive utilization up, you're optimizing your cost because you're getting the most out of the GPU instances that you're running in Kubernetes.

This really underscores the importance of making sure that the entire Kubernetes stack, including the tools that you use to manage your clusters, are GPU aware and GPU native so that you can do things like efficient auto scaling both up and down. You can monitor your GPU utilization when it's spiking and which teams are driving that GPU utilization. As I'll touch on in a few slides here, we're very much focused on making sure that you have the right primitives to be able to manage your costs and also optimize your utilization. When you layer in this complex landscape of both hardware variants as well as software, we've seen even teams that are really good at running Kubernetes struggle to bring AI on top of Kubernetes because this space is just moving so fast and there are unique things for running models and serving models on EKS and also Kubernetes.

Thumbnail 2080

All of these things are what we're laser focused on solving, myself, my team, as well as across the board at AWS. I'll go through a few of the ways that we've been chipping away at some of these challenges. One of the ways that addresses that first challenge I mentioned, which was GPU capacity, is that we've launched several new GPU instances this year and also increased the volume of GPU instances that are available. Going through the ones that are listed on this slide here, these serve both the largest, most demanding AI workloads and also the smaller scale, single business unit use case type of workloads.

At the top of the list here, earlier this year we launched support for GB 200, and at re:Invent this morning, we announced support for GB 300. These are really meant for your largest, most demanding AI training and inference workloads. These are powered by NVIDIA Grace Blackwell GPUs, and they also enable multi-node GPU to GPU communication via EFA and NVLink.

Earlier this year we also launched support for the P6B200 and P6B300. These are also NVIDIA Blackwell GPUs suited for medium to large scale training and inference, with up to 2 times performance compared to the previous generation of P instances. On the lower scale side we launched support for the single GPU P54X large, which is an NVIDIA H100 GPU for small to medium inference workloads, as well as fractional GPU instances, the G6F instance types. All of these come with EKS support at launch.

Thumbnail 2200

It is an ongoing treadmill for us on the EKS side, making sure that when EC2 launches these new instance types, you can rest assured that those are going to be supported by EKS from day one. What does it mean to have EKS support for these GPU instances? It means that we are pre-qualifying and validating the full stack of GPU drivers, kernel modules, software packages, and bundling that up in AMIs that you can use off the shelf. These can give you confidence that when you use these AMIs, they are going to work in your EKS clusters instead of having to piece together all of the various versions of the different components that you need to run these instances in EKS clusters.

One of the benefits of these AMIs that I do not think gets talked about enough is that by building all of this stuff into the host, into the OS image itself, you do not have to install those components at runtime. Coming back to the GPU utilization and cost optimization challenge, this is going to cut down your scaling time and cut down your time to workload because all those components are already on the node when the instance boots up. They are all baked into the EKS AMI.

Thumbnail 2280

Recent EKS Launches: DRA, Seekable OCI, Karpenter Enhancements, and ALB Target Optimizer

This year we have continued to update these AMIs with the latest version of the kernel, Kernel 6.12. We also bumped the version of containerd to 2.1, as well as the NVIDIA driver version, which is now on the latest 580 NVIDIA LTS driver version in those AMIs as well. Now I want to talk about one of the upstream Kubernetes features that I am really excited about. It is called dynamic resource allocation, or DRA.

If you have been running AI workloads on Kubernetes or on EKS, you are probably familiar with device plug-ins. Device plug-ins have traditionally been the way to allocate extended resources or devices to workloads that are running on your cluster. DRA is a new take on that. What DRA brings is a much more flexible and expressive language so that platform teams can define the types of devices that are available in the cluster and application teams can request those devices via workload definitions.

One example that this new API and resource model enables is that when you are running GPU workloads, you often want the GPU to be on the same PCI route as your network device to maximize network traffic. You can do that with DRA with the CEL that it exposes. I am very excited to see what the device vendors do with this. We enabled this in EKS as of Kubernetes version 1.33 at GA upstream and 1.34. It is still fairly early days for DRA. The various vendors, NVIDIA, AMD, and us on the EFA side, are very much working right now to build DRA support at the driver level so that you can transition to this as the new way to expose GPUs to your applications when you run them on Kubernetes.

Thumbnail 2380

This next one here is fast container pulls and specifically seekable OCI, or SOI for short. If you have run inference or if you are running models on EKS, you have probably experienced the pain point that you have to wait a few minutes for that container to pull down from the registry and start up on the node. The reason for this is that many inference frameworks are very large: PyTorch, vLLM, SGLang, whatever you are using. If you are storing your model in a container that is even larger—like the 40 gigabyte example shown earlier—it could take a really long time to pull from a container registry onto the node.

What we introduced with Sochi is a new snapshot mode called parallel pull and unpack. This makes the container pull from the registry a concurrent process, as well as the unpack on disk a concurrent process. It's really just utilizing the underlying network and disk infrastructure available to get that container down from the registry onto the node and start it up faster.

Thumbnail 2470

One of the nice things about Sochi is that you don't have to change anything. It works with OCI format container images and should work with your existing build processes. In fact, we enabled it by default in EKS Auto Mode when you use GP or Graviton instance families. If you're using these instance families with Auto Mode, you should now be seeing faster container image pulls.

On the compute side, we also had to take another look at Karpenter and EKS Auto Mode to make sure that the nature of these AI workloads works well with the provisioning and auto scaling. The first thing we had to add was support for EC2 capacity reservations in Karpenter and Auto Mode. You can use these by specifying your capacity reservation ID, which can be an On-Demand Capacity Reservation or an ML Capacity Block, and use those with Karpenter and Auto Mode.

We also had to add static capacity provisioning. Both for inference and fine-tuning, there's often a baseline of traffic and compute that you know you need to serve the workload, and then inference in particular can be bursty beyond that. The way Karpenter works by looking at pending pods and spinning up instances based on those pending pods didn't really fit well with the inference and fine-tuning patterns as well as training. So we added static capacity provisioning, which allows you to define a node pool with a set of nodes and define that as your baseline. Karpenter will spin that up without even looking at pending pods. Customers used to deploy dummy pods or balloon pods that Karpenter would provision instances to host, but you no longer have to do that. You can use static capacity provisioning now, and that's enabled in both Karpenter and Auto Mode.

The last thing I want to talk about is node overlay. As Karpenter selects instance types, it uses the EC2 instances API to learn about what those instances have available from a resource perspective and how much they cost using the fleet API. With node overlay, you can pass additional information to Karpenter that it will use in its instance selection for a given pod. This is really helpful for AI workloads because customers often have custom pricing agreements or need to select nodes that have certain huge pages or other resources available on them. Node overlay is an additional thing you can pass to Karpenter that it will use in its instance selection. This is currently only available in Karpenter, although we do plan to bring it to Auto Mode in the future.

Thumbnail 2650

Now I want to talk about the control plane. One of the key launches we had here at re:Invent was EKS Provisioned Control Plane, which allows you to select from a new set of higher tier EKS control plane sizes in a self-service way via API. This is important for AI workloads because if you're running large scale—we're talking hundreds of nodes and high traffic use cases—you can pre-provision your EKS control plane so that as you scale up your nodes and traffic, you can have confidence that the EKS cluster will be able to handle that load. Another really nice use case is if you have a launch coming up or a peak event like Black Friday, when you know you're going to have a surge in traffic, you can use Provisioned Control Plane to scale up the EKS control plane before your event and have confidence that the EKS cluster will be able to handle the load when it comes in during your launch or during your peak event.

Thumbnail 2720

One of the things that we've been focused on for a while in EKS is making sure that customers can use EKS and run Kubernetes in an easy way no matter where they need to. Last year at re:Invent we launched EKS hybrid nodes. This allows you to use your on-premises or edge capacity as nodes in EKS clusters. We've seen a really strong pickup and affinity for customers who said, "I bought these GPUs last year. I just want to run them in my same EKS cluster that I run all my GPUs in AWS." They use hybrid nodes to do that now.

Thumbnail 2790

Another interesting use case that we've seen along the lines of GPU utilization is bursting to AWS from on-premises and also vice versa, bursting to on-premises from AWS when your primary capacity pool gets completely consumed. We've had customers like Flawless AI that are highlighted on this slide really increase training times while also reducing their operational overhead by using EKS hybrid nodes. So now to touch on a feature that was implemented outside of EKS but I think is very relevant for customers who are running particularly inference on EKS. This feature is called ALB Target Optimizer.

At its core, ALB Target Optimizer changes the way that ALB has balanced load in the past. Traditionally with ALB it's very much a push model where the load balancer is pushing traffic, pushing requests down to the targets either with a round robin or a least outstanding request model. With Target Optimizer, it flips that where it's now a pull model. There's an agent running on each node that's letting the load balancer know when it's available for work based on a max concurrent request. Why is this important for inference? Inference workloads typically have a much lower concurrent request rate than your normal web service type of workflows.

You can configure, let's say a max concurrent request of one or two or ten. ALB is going to use that information to route the requests. This is another way that we've seen customers drive up that GPU utilization while also reducing the error rates based on the load that's incoming for the inference workloads. Another use case where this is really interesting is we've seen customers that are running CPU-based workloads alongside their GPU-based workloads. With Target Optimizer for the CPU-based workloads, you can say, "OK, these have a much higher concurrent request." For the GPU workloads they have a lower concurrent request. Those can be running on the same EC2 instance, and ALB is going to use that information that it's getting from the agent to route those requests efficiently.

Thumbnail 2900

The Future of AI on EKS: MCP Servers, Intelligent Automation, and Next Steps

Now I want to touch on something that we launched for folks that are building AI agents. I've talked to several customers who have their platforms that they built on EKS, have run on EKS for a number of years. They're looking to use AI as a way to make their troubleshooting, observability, and SRE processes more efficient. We launched the hosted EKS MCP server at re:Invent this year. It's currently in preview, but this is really something that you can use with those AI agents to get up-to-date contextual information about your EKS cluster fleets.

You can do things like look up pod logs, Kubernetes events, and CloudWatch metrics. We also enabled write operations in the MCP server. I generally recommend customers approach with caution. Definitely start with the read side, start with the observability side, and then transition into those mutating operations as you go. This is available in all of your AI-assisted IDEs. It's also available in Q Console and Kiro. The one thing that I want to call out here about the security model with this is there is a local MCP SigV4 proxy.

What that is doing is it's taking the IAM credentials that are configured on the local client. It's passing those through to the hosted EKS MCP server. Those are getting passed through down to Kubernetes RBAC. So it's very much integrated in how you use any other AWS service. The hosted MCP server looks and feels like any other AWS API. We host it, we scale it, we keep it.

Thumbnail 3020

You don't have to worry about that if you're going to use this with your AI agents. All of those things that I just talked about really add up to EKS being a trusted and reliable way to run your AI workloads. One thing that was announced recently at KubeCon North America in November was the Kubernetes AI conformance program. On the EKS team, we're proud to be one of the providers that was included in that first set of validated Kubernetes providers. Features like DRA that I talked about earlier, gang scheduling, auto scaling, and observability—all of that which you need to run your AI workloads is there in EKS.

Thumbnail 3060

I've talked about a lot of the stuff that we launched recently. Now I want to connect that to how we're thinking about AI workloads on EKS broadly and also looking a little bit into the future. A lot of the things that I touched on here, and to Christina's analogy earlier about planting the right seeds, we're at the foundational level of making sure that you have the right things that you need to reliably run AI workloads on EKS. This includes the work that we're doing to support new EC2 instance types, the work that we're doing in the EKS AMI, and making sure that all of the things that are happening upstream get into EKS very quickly for you so that you can have that reliable foundation and reliable infrastructure to build upon.

Thumbnail 3110

We're also very much focused on making sure that all of the tools and automation that you're using today extends to and is adapted for GPUs and AI workloads. There are a lot of different pieces of this stack, from observability with the node health monitoring agent and EKS CloudWatch observability metrics, to making sure that Karpenter is adaptable to the types of workload patterns that we see with these AI workloads. We have been making a lot of progress on making sure that all of those things that you're using today for your normal workloads, you can also use them for these AI workloads. We're going to continue to make sure that across the board with all of these tools and all of these features in EKS, they also work for your AI workloads that you run on EKS.

Thumbnail 3160

With those things in place, we also see opportunity to start moving up the stack and providing things out of the box so that you don't have to sift through a jungle of open source projects just to run gang scheduling with topology awareness on EKS, so that you don't have to write your own homegrown load balancer just to run disaggregated or distributed inference, and so that you don't have to be concerned about giving your agents that are running on EKS code execution privileges. These are all the things that we just want to be part of EKS so that as you all transition to running more and more AI workloads, you have these things out of the box. They're fully supported by us, and they're fairly easy to use so that you don't have to piece all the puzzle pieces together yourself.

Thumbnail 3210

And then lastly, the holy grail of what I consider to be intelligently automated, this is a balance of where EKS is going to increasingly handle the things that cause you pain today. Upgrades are really painful, and we need to make those smarter in EKS. Proactive cluster alerts so that the cluster is telling you what's going wrong and ideally how to fix it. This of course needs to come with the right levers so that you can tune how much you want EKS to intervene and do things on your behalf, but this is very much where I see us going on the EKS side, using AI within the EKS service to make your lives easier overall. That's a little bit different than running AI workloads on EKS, but it's something that across the board on the EKS service team we're very excited about.

Thumbnail 3270

With that, I will pass it off to Christina to bring us home. In summary, we covered what is the difference, what is the delta of running your regular workloads and your AI workloads on Kubernetes. What's the delta of running your business applications and your AI applications on Kubernetes, how to use agentic AI to manage your Kubernetes clusters, how to run your inference workloads, and best practices to also fine-tune those workloads.

Thumbnail 3320

Thumbnail 3350

Chris covered some of the recent investments we're making into Kubernetes. It will sustainably take your business to the next level. It is flexible enough where you can run CPU workloads side by side with GPU workloads. These workloads with Kubernetes is portable, so you can port them. It will scale with your business needs. And if you don't need it to scale and you just need static capacity, it does that too. Now, by tuning and optimizing it, you can get the best cost performance out of your GPUs so that your organization alongside your website, your smartphone app, and your cloud workloads can solve problems for your customers that were unsolvable before using AI.

Thumbnail 3380

Thumbnail 3400

Where to go next and learn more, we have our workshop series for inference and agentic AI. We run those virtually every single month. We update them every single month. You can come back as many times as you'd like. If you'd like to get started on your own, you can use our AI/ML EKS user guide or our Terraform blueprints that are set and ready for you to create your cluster and optimizations.

If you're sticking around for a couple more days here, I highly recommend these three related talks. Check them out tomorrow and Thursday. With that, if you do walk out with your AI anxiety lowered by one notch, I call the session a success. Thank you very much.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)