AWS re:Invent 2025 - Optimize for AWS with intelligent automation (AIM235)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Optimize for AWS with intelligent automation (AIM235)

In this video, the session explores performance and cost challenges of Agentic AI workloads in AWS and hybrid cloud environments. The speaker explains how Agentic AI's unpredictable, non-linear resource consumption leads to overprovisioning, with some customers experiencing only 30% GPU utilization and 10-20x excess GPU hours. Turbonomic is presented as a solution providing real-time optimization through continuous analysis of GPU instances, vCPU, and memory metrics. A case study demonstrates $13,800 monthly savings from a single P3DN 24XLarge GPU downsize to P3.8XLarge. The BAM team achieved 5.3x reduction in idle GPU resources and freed 13 GPUs for reallocation. The session emphasizes that visibility alone is insufficient—automated action based on insights is essential for balancing performance and cost efficiency.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Hidden Costs of Agentic AI: Overprovisioning and Unpredictable Resource Demands

Hello everyone, thanks for joining the session. Today we're going to talk about one of the growing performance challenges with AWS customers and hybrid cloud customers in general, which has to do with Agentic AI and AI applications in general. Although they're very powerful, they tend to be unpredictable in usage from a resource perspective. We're going to spend some time talking about how this problem manifests, more about the problem itself, and then discuss some potential solutions. Ultimately, you want to make sure that you maintain the performance of the end users who are using Agentic AI, but also make sure that you are as efficient as possible when it comes to cost allocations so you can avoid that blow.

Before I get started, I would love to get some feedback from the audience with a quick poll in terms of where you are with adoption of Agentic AI. Raise your hands if you're just exploring, just getting started with learning about Agentic AI and where it might apply for your environment. How many of you are running proof of concepts, where you actually have some workloads in development and you're seeing that? How many of you are in early production, where you actually have some real users and a number of work streams applied to that? And then last, business critical, where you've got real users, you're using this at scale, it's a big part of your ROI. It's not quite yet at that maturity adoption level, but a lot of that makes sense given what we see from our customers too.

So, expanding on the hidden cost of Agentic AI, a lot of times what you see is overprovisioning and resource bloat. Primarily, this comes down to the fact that people don't make performance or resource decisions based upon averages. Usually, they'll take the worst case scenario because they're worried that performance might suffer. This leads to really conservative capacity decisions because ultimately they might be afraid of performance degradation, which will lead them to oversized GPUs, overallocate RAM farms, and overprovision storage. This leads to a lot of idle costs for these expensive resources. What you'll see from a lot of teams is that for fear of creating these performance buffers, they will overallocate these expensive resources, and then many times, just because they're afraid of that peak that might happen, there are static scaling policies in place.

However, some of the problems with these scaling policies is that they tend to be reactive. If you have a peak that happens, then it's already too late for the scaling policies to take effect. With Agentic AI and AI apps, a lot of times the usage is unpredictable. It also requires a fair amount of human intervention to stay on top of that, so you're consistently looking at those scaling policies or taking remediation to deal with that.

Talking about how Agentic AI is changing how workloads behave, a lot of this comes down to the fact that with traditional legacy apps, they typically follow a linear process. A request comes in, it gets processed, and it consumes resources, and this typically follows a straight line. As more users show up and hit the site, for example, resources get consumed in a linear fashion. With Agentic AI and AI apps, it doesn't necessarily work that way. It's very unpredictable. It can plan, branch, and then spawn many different, sometimes hundreds of micro workloads. You could have Gen AI talking to another Agentic AI talking to another Agentic AI, Agentic AI talking to a number of APIs, and all this is happening in the background, creating these resource spikes that can sometimes bring applications to its knees.

One user request, for example, can come in and then create hundreds of downstream tasks, and this leads to overprovisioning of GPUs. You see issues with throttling and spiking, and as evidenced here, we've seen customers who have to deal with overprovisioning and sometimes utilization of 30 percent of what they should be, and 10 to 20 times the GPU hours than what's really needed.

Ultimately, you end up with a situation where these idle resources just sit there because you're afraid of performance risk, and they become very wasteful and a hidden tax upon the organization. You really need to consider how you can make sure that your resource usage is dynamic, just as agentic AI and AI apps are dynamic themselves based on their nature.

Talking about the gap between insight and action, you do have observability solutions which do a great job giving you information such as whether CPU throttling is happening, whether there is GPU contention, or whether there are latency spikes. You can see observation there, but it doesn't necessarily do anything about what you should actually do when that happens. Maybe you can do a root cause analysis and fix it after the fact, but it doesn't actually take action.

When you're dealing with these types of agentic AI applications on the FinOps side, FinOps does a great job in terms of showing how much you're spending or if you need to do reporting for showback or chargeback. You can see that and allocate it appropriately for these applications, but again, it's primarily around reporting. It's not meant to do resizing of GPUs or to handle these dynamic scenarios. It's really more about cost allocation versus actually solving the problem.

Even with some of these other solutions, oversizing becomes the default. Many of our customers who come to talk to us about this problem will say that because they're worried about performance and maintaining their SLOs, they'll just oversize what they have in the public cloud or even in certain cases in the private cloud. Static scaling is in place, but because of the nature of these apps, they're so dynamic in terms of the way they do their work, they can't keep up with the demand. Making sure that you really look at the problem from a resource perspective and have dynamic resources for the platforms that support it to meet that ever-changing demand is critically important.

Turbonomic's Real-Time Optimization: Dynamic GPU Right-Sizing and Intelligent Automation

What Turbonomic does to deal with these problems is deliver real-time optimization to agentic AI workloads. We provide actions that look at continuous optimization by analyzing data that comes from the GPU instances themselves, vCPU and VMM saturation, throughput for storage and network, and look over time at these metrics. We then ultimately make decisions across your environment to right-size based upon what we see from these metrics.

We not only analyze that information but look at this entire supply chain from the agentic AI app and also look at the resources that are supporting it in real-time, scaling up and down. We look across EC2, the GPUs themselves, EKS, and hybrid environments, of course AWS, but also if you have on-premises workloads or workloads in other public clouds. As we do this, we make sure we maintain performance and meet your SLOs, but also keep things efficient so cost is under control.

GPU optimization is one of the most impactful use cases that can help with agentic AI given how much of the resources come from there. As you see peaks come in where more resources are needed, we will right-size them and bring in additional GPU instances if required. As demand goes down in real time, we'll right-size those instances and reduce them as well as other resources like memory to reduce that overprovisioning and deal with some of the challenges that our customers face. By doing this on a continuous basis, it takes a lot of the human error out of it and allows you to know that these agentic AI apps are going to run in a performant manner but also as efficiently as possible.

I'm going to take you through an example. This is a screenshot from our product, and what we're looking at here is a pending action, which is essentially an action that the user can take, so this is an example of a manual action. Turbonomic has been analyzing the GPU usage, the GPU memory, the instance sizing and utilization in general, and it's looking at this particular GPU which is a P3DN 24XLarge, one of the more expensive GPUs for AWS. Based upon the utilization that we're seeing, you can scale this down to a P3.8X large.

We're able to do that because we're looking at the utilization data and matching this up with the SLOs and knowing that this is going to be a safe action that can be taken. As you can see from just one single GPU, you can get a savings of $13,800 per month, and this is still keeping your performance maintained in the SLOs.

Getting down to a little bit of the details because it's a great recommendation, but our customers and operators want to know how they can trust that the action will maintain performance. So I need to see an additional level of details. What you see here on the left-hand side in some of the charts shows the GPU count percentile and utilization. The utilization for this observation time period is around 13%, so it hasn't really been used that much, even though it's tied to this EC2 instance. Then you look at the GPU memory utilization, and that's hovering around 22%, again not really something that speaks to a lot of usage in general. With VCPU, it hovers around 3 to 4%. So not a ton of resources are being used for this app.

If you get into the resource impact, the red box on the right, you can see the current and then post-recommendation for this action. Turbonomic is recommending to take the current GPU count from 8 to 4 and predicting that this will lead from 13 to 26% utilization, which is a projection. On the memory, it's going from 32 to 16 gigabytes and also predicting that the utilization will go up to 44% after taking this action. This is very safe, and you're not going to see a performance degradation. The VCPU is around 13%, and then you see storage throughput and network throughput in a similar situation.

The nice thing here is that as an operator, you can look at this action and feel good about it because you know that the performance is there and we're still going to maintain our SLOs. You feel comfortable doing that. It's not just saying you're going to save cost on this action and should just take it. It's giving that user the confidence that if they take this action and potentially even automate it, they feel confident that things will move forward in a positive manner.

To wrap it up on this potential action, you can see the overall summary. The on-demand rate goes from $31.21 an hour down to $12. It does take into account our RIs and savings plans, and then you see that on-demand cost, which rolls up to around $13,800 savings per month. This is just from one single GPU instance. So imagine if you have quite a big investment, then you're going to see even more savings.

Let's talk about some more of the details in terms of how intelligent automation can really transform your AWS environment. We've talked a lot about smart GPU optimization. What we will do is automatically tune those GPUs. We will right-size based on demand spikes up in real time, and then we will bring it down to make sure we're as efficient from a cost perspective as much as possible so you can eliminate that idle capacity.

We also have real-time visibility, so our application looks at all of the resources running in real time. We have a supply chain that will map the business apps themselves all the way down to the resources, and we look across those resources and how they work together ranging from EC2 and EKS and then the GPU environments. We're able to correlate those metrics and make sure that no harm is done and that ultimately those applications are getting the resources that they need. We'll also take into account that there might be a potential issue or a bottleneck in that supply chain. It may not be hitting one specific resource, but we'll identify that and take account to be proactive about what we're doing.

We also have orchestrator integration which will understand things like pod placement, affinity, resource quotas, and scheduling. So we'll optimize the container infrastructure and the traditional infrastructure layers together for those business apps. We also have proactive automation as well. The previous example I gave was a single action where you can completely automate the actions that we're taking, so no human intervention is required. If we are identifying a potential issue, we'll go ahead and take care of it. So you don't have to have those incidents that happen that require an RCA and trying to problem-solve after the fact.

At the end of the day, part of what we do is around efficiency and cost savings, and cost allocation is a big part of that. We sell the businesses and so we will sum up individual actions, we will sum up the overall ROI of all of the automation that's happening and present it back to our customers, and that's available as well.

Proven Results and Next Steps: Case Studies and Implementation Strategies

I want to go through a couple of case studies that apply directly to some of this work that we've done. One of our internal customers is the Big AI Models team, or BAM, which supports the LLM behind Watson X. Really what they were looking at is they needed to improve their environment so that there was less manual tuning required to look after their environment. They have hundreds of containers, they're running Kubernetes, around 100 A100 Nvidia GPUs, and really what they wanted to do as well was figure out how we can potentially make these GPUs more dense and get more ROI out of what we were doing.

After using Turbonomic they saw some great results. They achieved 5.3x reduction in terms of idle GPU resources. This took the headroom from 3 to 16, so these 16 GPUs and resources were available for other workloads. They saw 2 times throughput improvement without impacting latency, so that was a big benefit for them. They had 13 fewer GPUs that were needed, and they were able to allocate these GPUs to other workloads, so huge savings for them just by making the other GPUs they had denser and being able to allocate those new GPUs as well as some of the additional resources from the GPUs they still had allocated to other new AI workloads. You can scan the QR code if you want to hear more about it or read more about it, but again, great example of us working together on that.

So there are 3 things to remember today that I'd love to leave you with. Agentic AI, because of its nature, is very unpredictable. It requires a lot of resources and it doesn't scale in a linear fashion. So remember when you're thinking about projects along your maturity curve of adopting agentic AI, keep this in mind. You're going to have to be sensitive to this because they're bursty and unpredictable.

Visibility alone is not enough. Just having access to the resources behind it and seeing some of the issues can still create a lot of manual work for you to have to support it. Being able to have a solution that takes those insights and creates continuous action to right size your environment and make sure those applications have the resources that they need is incredibly important.

You can really start with one high value workload. Look for a potential application where you want to prove that you can still meet performance but have GPUs be more efficient. Target that and then scale it out and see some of those results. As far as working with us, we'd love to work with you in terms of looking at a solution together, identifying a pilot workload, and whether it's agentic AI or a GPU service where performance or cost matters most, we can balance both of them.

We also have a broad set of integrations across the hybrid cloud. We have a ton of AWS services that we optimize. We also do private cloud with VMware and Nutanix and others. We support OpenShift and hybrid environments and other CSPs. There's also an opportunity to combine observability and automation. Our sister product Instana provides even better metrics to power our automation and make our right sizing actions even better.

I'll just leave you with thank you for your time today. I really appreciate all the feedback and responding to the poll. Our booth is over there if you want to come and get a demo and see us in action, so please stop by. Again, thank you for your time and have a great rest of your show.

; This article is entirely auto-generated using Amazon Bedrock.