Kazuya

Posted on Dec 8, 2025

AWS re:Invent 2025 - Ops in the AI age: Innovating together for faster, more efficient operations

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Ops in the AI age: Innovating together for faster, more efficient operations

In this video, AWS Vice President Nandini Romani presents innovations in cloud operations and AI observability at re:Invent 2025. The session features customer stories from PGA Tour and Warner Bros. Discovery, demonstrating real-world AI implementations. Key announcements include CloudWatch Generative AI observability with zero-code OpenTelemetry instrumentation, CloudWatch Application Map for automatic service discovery, AI-powered CloudWatch Investigations for root cause analysis, and MCP servers for IDE integration. The presentation showcases how AWS helps customers monitor agentic AI applications, troubleshoot with AI assistance, and simplify operations through features like cross-account log centralization, CloudTrail aggregated events, and mobile Real User Monitoring for iOS and Android.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Welcome to re:Invent 2025: Cloud Operations in the Age of Agentic AI

Please welcome to the stage, Vice President of Search, Observability, and Cloud Ops at AWS, Nandini Romani. Hello everyone and welcome to re:Invent 2025. I'm excited to kick off the very first innovation talk of 2025, and every year I look forward to this session myself, along with my team, because it motivates us to look back on the year that we've spent innovating and adding new features so that we can address cloud operations for all of you.

We have been innovating with you not just this past year, but we have been doing this since the birth of the cloud. In the last two decades, we've seen some major paradigm shifts in technology ranging from virtualization to containers to serverless and so on. We've been on this journey with you, and now we are on to the next tectonic shift, agentic AI. We are living through the next big transformation, and just like compute, databases, and storage, AI will be a foundational part of everything you do.

AI has disrupted customer experiences to such a point where our end customers now expect all of their end user experiences to be faster and smarter, leveraging AI. I know many of our customers, many of you here, have already deployed AI or you are at least using some of it to meet this demand. Having said that, we do know that generative AI adoption varies by industries, by verticals, and other specifications based on what you are working on. According to one report from McKinsey, 88% of organizations are already using AI in at least one business function.

More than a third of you have either already fully scaled or are scaling and deploying AI. Another third of you are running pilots, and the rest of you are still experimenting. Many of you, I think, especially for those of you in highly regulated industries, you might be moving with more caution, and understandably so. This is why compliance and security is super critical when it comes to generative AI, and this is where AWS can really help you.

We have done this for you when it comes to infrastructure for decades, and we will do the same for you when it comes to AI. So no matter where you are on this adoption curve, we want to meet you exactly where you are, whether it comes to cloud or AI. As many of you know, we at AWS love telling our stories through the voice of our customers, and this time around, I want to kick off this talk by introducing one of our customers who uses AI to enhance their fan experiences.

PGA Tour: Delivering AI-Powered Shot Commentary to Golf Fans

What is the PGA Tour? It's a traveling technological masterpiece spread across 200 acres with over 120 cameras and 36 radar trackers collecting data on more than 32,000 shots, every player, every swing, every week. Using the power of AI to give fans what they want, anytime, anywhere. Amazing.

So please welcome David Proven, VP of Digital Architecture at PGA Tour, to the stage. David, welcome. Thank you. Good morning, guys. Thanks so much for coming out here today. Excited to talk to you about what we do at the PGA Tour. If you haven't heard of the PGA Tour, the video hopefully has fixed that for you. Please go and download the app immediately and rate it five stars. We'd really appreciate that.

Golf is unlike any other sport you're going to consume on a week by week basis. We don't play inside a set amount of white lines. We don't play in the same stadiums. We play in these amazing natural venues throughout the world that change, that deal with weather. Instead of one field, we have 150 acres of entirely playable area.

Each week, 156 players all play asynchronously against each other, which leads to up to 14 golf balls at any one point in the air being tracked in real time. Like we said in the video, over 31,000 shots are captured each week on the PGA Tour.

So what we do with this is what we're looking at with UTA. How do we actually take AI solutions we'll talk about, and I want to first talk about how we think about that operationally, because at its core, the PGA Tour is an operationalization that runs golf tournaments. That's what we do. Week in and week out, we are running golf tournaments throughout North America and some other parts of the world, and we are building, installing, running, tearing up, and putting back down again. We are an operations company at our core.

So we approach AI and we think operationally first. How do we score and validate things that happen in real time? If we're not first, we're last in golf. How do we keep a human in the loop at the right point in time? Everyone talks about shift left, DevOps shift left. We took operational thinking and shifted that left. From the beginning, how can we approach operational thinking in what we do and what we built?

What we built was this, which is our AI shot commentary system. For those of you that don't know, we provide a product each week on our digital platforms called TOURCAST. TOURCAST is a full 3D rendering of a shot-tracked golf course. People spend between 30 to 60 minutes in this experience, which is one of our most engaged platforms that we build upon. For us, we really wanted to build a commentary system, for me in particular, that didn't just tell you what was happening but told you why it was relevant and why it mattered. It's really great to say a player scored a goal, but not say any additional context, which is the interesting fact that goes with it, right?

So our commentary is generated in two elements. We generate a fact. Scottie Scheffler hit the ball 175 yards, he is 4 feet from the pin, and then we add context into that system. He has a 98% chance of making this putt for a birdie and will end up in third position on the leaderboard, which will move him above the FedEx cut line. And that is really taking a point in time and making it relevant to our fans.

Building Operational Excellence: Real-Time Data Processing and Validation

All of this is powered by our cloud operation and how we build it. We're going to walk you through how that system works. Firstly, we talked about at the beginning, there's those radars. There's a golf course collection system where we're literally gathering tons of data, like terabytes of data each week off the golf course. All of that folds into our cloud-based scoring system, which is doing coordinate realization. It's working out which player hit it, where did it go, it's updating statistics in real time, what's their new strokes gained value, what's their new statistical values.

We then built on top of that our AI shot commentary system. What this does is every shot on the golf course is ingested in real time. We analyze 72 million statistical records to find the most relevant one and then present that back over to the shot commentary system in TOURCAST. While that left to right flow is super exciting and we're really proud that it works, what really underpins it is the operational dashboard underneath. How do we know it works? How are we sure that it's operating fine each week on the PGA Tour?

So we're using synthetics, we're using insights, we're using alarms, log analysis to give our operations team, which are really the beating heart of what we do at the tour, confidence that it works. There is not a better feeling in the world than when the golfer hits that first tee shot at some ungodly time in the morning because we start golf at first light to see everything flow through the system and work. It's a really powerful feeling when it works and a mildly terrifying feeling when it doesn't work.

So let's talk about the CloudWatch dashboard that powers the AI shot commentary system, and we're going to zoom into some elements here to help you understand how we're taking these AI problems, scoring validation, human in the loop, and production first thinking and actually realizing those. First one I'll show you is our validation score. So for those of you who grew up in England, which is me, we have this concept in driving tests of minor faults and major faults. Our scoring is analyzed against this when the commentary's generated. Is this a minor issue or a major issue?

If it's just a minor issue, we'll push it through, maybe we'll regenerate it. The blue line is showing you here are things with zero errors, they went straight through the UI, no human involvement. The orange is one error, it's usually a stylistic language thing, so it still goes through. Anything more than that, it goes into review, we don't put it in the UI, we hold it. About 98% of our commentaries go through without further involvement. When this doesn't work, we can then see what those look like.

In the dashboard view again, the top one is showing you commentaries that for some reason failed validation. The score was wrong, the par was wrong, the yardage is wrong, something like that. Underneath is my favorite one because it sits there and ticks through when they work, the commentaries that are going through that you can see on the platform.

Troubleshooting in Production: Learning from Failures and Improving Response Times

Now I want to show you what happens when it doesn't go well, which does happen sometimes too. So the first chart on the left is really just a standard LLM's timing, we like flat lines, things should look consistent in timing. This one is from Bank of Utah a month ago. We had not had an event on the PGA Tour for about one month, we had the Ryder Cup. No real PGA Tour events after the Procore event. Scoring starts up and that moment in the pit of your stomach when something doesn't work, no shot commentary. And this is a sponsored feature for us, it's high profile.

At that point in our traditional model, you go into debug logging, how do I analyze this, how do I pull this out, how do I get the developer around to go look at this? Because we'd shifted that thinking far left,

we were able to pick up our Amazon SQS queue and realize very quickly that we were getting commentaries but not generating them, so our queue was not being processed. As opposed to traditionally what's taken maybe a full day to dig into it, get the right person, get them online, and get it fixed, the first line support team just restarted the cluster. My favorite line in CloudWatch is when it goes straight up because that means you've fixed it. You see everything sync back together and going in parallel.

We then learn from this. We put in new metrics, we put in synthetics, and we're now sending control messages through Amazon SQS so we can keep things operating. We learn and we get better. We're not perfect, but it's one of those things where real life happens. How do you respond to it? In our sport, if we don't fix it immediately, the backup just builds. Those golf shots keep coming, so the quicker we can resolve it, the better for us.

As a result of this, we talked about 31,000 shots, and each of those shots costs us a penny to generate for shot commentary. That's the Amazon Bedrock fees, that's all the stuff that goes with it, the CloudWatch fees, one penny per shot. It's really important for us that we can do this thing cost effectively. We've also added zero extra humans to monitor this system. It's folded in with our traditional operational support team. No extra bodies, we just added dashboards, alerts, and reporting into our existing setup. It's all in CloudWatch, which we use already.

Apart from that issue a month ago, the system's been highly available. My favorite thing on a Thursday morning is to watch everything just work. It's the greatest feeling in the world, and when it doesn't, it sucks. But fortunately, it's been pretty good. With that problem resolution, we have a much quicker response time when we dashboard constructively and thoughtfully, and we involve the developers in that process right at the beginning, moving that decision tree down.

But we're not done. We are relentless in trying to do better, and we're currently working with AWS on piloting how we can use AI investigations to really help us change how we do log analysis. For those of you doing this today, probably a traditional thing is you're logging out JSON objects, logging out error codes. Quite often that has to go to a developer, like hey Steve, we've got this error code, what does this mean, or it's a giant knowledge base document I've got to fill in and understand.

We're actually playing now with what if we log the error message as a human readable string. Hey, I failed the ingest because the scorecard isn't up to date. We actually log out at the developer level. The developer knows what the issue is because they wrote the code, right out as a log statement, the actual English transaction, because it turns out LLMs are really, really good at summarizing that and telling you what the issue is.

What that's allowing us to do, as opposed to when an issue comes up and we've got to bubble up through level one, level two, level three, and you get to the guy at level three that fixes it in five minutes but took you an hour to get there, is to push that decision making way down. We're able to help those guys in the first line, the guys that have really taken the bullets on the front line, and let them move it quickly. As excited as I am with the stuff we've done to date with AWS, I'm actually more excited for where we're going to go next in this operational journey that we're doing on the PGA Tour. With that, I will hand you back to Nandini.

The Operational Challenges of Agentic AI: Trust, Complexity, and Data Explosion

Thank you David. It's truly inspiring to see PGA Tour, just like many of you, approach AI with the same customer obsession that we have here at AWS. Now, 2025 is the year when agentic AI moves from experimentation into production. These agents no longer just respond, they think, they act, and they even make decisions on your behalf.

Now while agents certainly help you with creating efficiency and making things a lot simpler, it has also created certain operational challenges. Let's think about it. First is trust and safety. Visibility into your AI agent behavior is super critical. Think about, let's say a customer service agent that you've created and a customer reaches out regarding some billing issue. Now the agent in a matter of seconds makes several decisions. Does it handle the matter on its own? Does it escalate to a human specialist?

The question when it comes to observability is no longer did the agent work. We all know the agent works so great, but the question becomes how and why did the agent choose to handle the situation the way it did. Second, operational complexity is only increasing. You are now managing microservices. In addition, now you have distributed agents. You have event driven architectures, and on top of that you are managing all of this across dozens, hundreds of accounts and multiple regions. You want AI to be able to help you not just with applications but also making operations more efficient. Third, data explosion.

We have seen this over the last several years, and now your AI agents work 24/7 and they handle thousands of customer requests every hour. Each of these requests is generating an immense amount of telemetry throughout the day. That's not just more data. It's exponentially more surface area that you now have to both monitor and secure. We have been working to make these things simpler for you so that we handle the heavy lifting of cloud operations.

Now as complexity grows, our mission at AWS stays the same. We simplify cloud operations for each and every one of you. At AWS, we pioneered cloud computing, but we are not stopping there. Through every shift in technology, we have always been there for you, removing the undifferentiated heavy lifting and building new capabilities that work well together. We will continue to do that whether you are all in on AWS or in hybrid environments or even multi-cloud. Whether you're using autonomous agents or still deploying traditional microservices, we handle operations for every one of you. This is our commitment.

Today I want to show you exactly how we are delivering across these three pillars. First, monitoring and trusting your agentic AI applications. When an AI acts, it does so autonomously. You need visibility into what the agent is doing and the why behind it. Second, operating smarter with AI. As complexity grows, we want AI to help you with your operations. Third, simplifying overall operations, and I'm talking about traditional observability for databases, containers, and so on and so forth, not just as it relates to AI.

Monitoring AI Agents: CloudWatch Generative AI Observability and Application Map

Let's dive into the first pillar. As I mentioned earlier, you now have agents that are smarter and taking actions on their own on your behalf. Picture this. Let's go back to the same chatbot who's handling customer support issues. Your agent now is making decisions in a matter of seconds which you may or may not agree with. Agents are right a lot. We know this. However, they are not always right. What if an agent makes the wrong call, the wrong decision, and turns away your customer? This will impact not just your end user experience but also your brand reputation. This is why observability is now the control plane for your trust, safety, and accountability.

Traditional monitoring tools can tell you that the chatbot responded to your customer request in, let's say, 2 seconds, and it did so with a status code of 200 maybe. That's great, so you know the agent is working, it's running, but it doesn't tell you the why behind what decisions the agent is taking. AI agents can create hidden issues that legacy tools cannot see. Don't get me wrong, performance metrics are always important. You saw them in David's dashboards. You're going to see them every day, and they're super critical. However, they do not give you the entire picture when it comes to AI.

In October, we announced the general availability of generative AI observability in CloudWatch. This feature helps you monitor, troubleshoot, and optimize your AI applications and workloads. It simplifies a few things for you. First, you need to make zero code changes. This is because we leverage OpenTelemetry, which allows instrumentation and makes everything super simple if you haven't tried it yet. It also allows you to use any agent development framework of your choice. You can use Amazon Bedrock Agents. You can use LangChain, LangGraph, or CrewAI, and many more. Or you can also choose to instrument manually by yourself using the OpenTelemetry SDK.

Second, this works no matter where you choose to host your agents. You could choose, for example, Amazon Bedrock Agent Core, or self-managed Kubernetes hybrid environments, multi-cloud. It just works. You get complete visibility out of the box into latency, token usage, and performance across all of your AI workloads.

Third, you can pinpoint issues anywhere in your AI stack. So let's go back to that billing issue we were talking about. With end-to-end prompt tracing, you can now trace that single customer from the initial complaint that they made through every model invocation, every tool execution, every knowledge-based query, all the way to the final response from the agent to your end customer. Think about it. That's a series of events that you can track no matter where you choose to host your agent.

Now you can see exactly how the agent made the decisions it did, so you have entire visibility into the decision tree that the agent is taking. However, we know that agents do not operate in a silo. They integrate with other services, applications, and other dependencies across your entire estate, which would be across multiple accounts and regions. So when it comes to troubleshooting, it's challenging to understand where all of these resources are and how they're interacting with each other.

That's why we launched CloudWatch Application Map. This requires no manual configuration. Application Map will automatically discover and organize all of your services, including uninstrumented services. I think this is key. And the map organizes it based on exactly how you operate, based on your application, your teams, and even your business units. It correlates and analyzes metrics, logs, and traces and offers audit findings that simplify your root cause analysis.

Demo: End-to-End Visibility Across AI Agents and Infrastructure

Now when your customer service agent that we talked about, when it's responding slowly, you will not only know which service is degrading, but you will also know why it's doing so. So with that, we are now at an exciting part of this presentation where you actually get to see this in action. I'd like to invite to the stage someone you have all known for over 20 years. Please put your hands together for Jeff Barr, VP at AWS and Chief Evangelist. Jeff, come on stage and show us how it works. Thanks.

Alright, great to be here. I almost broke my leg coming in, but I'm alive and well, so happy to report that. It's been super fun to work with this team to prepare this content, and to me I'm thinking about this a little bit, and it's really kind of this idea from agents as kind of a cool thing that we could think about to now we're building them, now they're an essential part of our business. And now they're so important we actually need to monitor them and maintain them and make sure that they're working the way we expect them.

So we see these agents going from proof of concept to production. They connect with your customers 24/7. So you have to actually monitor them, maintain, make sure you're actually maintaining your customer trust. And what I want to do today is show you a bit about how that actually works. So let's take an example of a pet clinic. The pet clinic is growing, it's expanding its digital footprint, they've got medical appointments, nutrition recommendations, and more.

And these customers are pretty demanding. They want a near real-time experience. They want instant confirmation when they're booking an appointment at 2:00 a.m. for their sick pet. So to deliver on these expectations, they've got three agents. They put all these agents into use this year, and they run on completely different hosts serving different functions.

They've got a customer-facing orchestration agent. It's running on Bedrock AgentCore using the CrewAI SDK. This is the front door. It hits the customer inquiries, routes requests, and coordinates with the other agents. And second, the appointment scheduling agent. This one is built with LangChain. It runs on EKS and it manages real-time availability. It handles cancellations, it sends reminders, and it integrates with the clinic's existing scheduling systems.

Then there's a third one. This one's a nutrition recommendation agent. It analyzes pet health data and it recommends supplements, and it's hosted on Bedrock AgentCore using the Strands SDK. This one reviews medical history, it looks at and considers breed-specific needs, and it makes personalized product recommendations. So we've got our three separate agents, and they're here to help this pet clinic scale.

They handle thousands of simultaneous customer interactions. They make real-time adjustments based on appointment availability and health data. But there's a real challenge. The operations and the AI teams need visibility across all three agents,

and we saw how they're built on different environments and platforms to know not just that the agent is running. They need to know that this agent is actually making good decisions and delivering a consistent customer experience.

So we want to help to get you observability access across your agents, across any framework, and anywhere you might want to host. And of course we want to make it easy for you to start. And so we set this up so the implementation actually takes just two steps. First, you auto-instrument your application. You install the AWS Distro for OpenTelemetry SDK, which we call ADOT for short. Second, you enable transaction search in the CloudWatch console, just one click, and this lets you receive telemetry from that ADOT SDK and it gets sent to the CloudWatch OTLP endpoints. So real easy.

So once instrumented, you get unified visibility that goes across your complete AI fleet, no matter where you build and no matter where you host them. You can see your entire fleet at a glance. You can see runtime sessions, runtime invocations, CPU consumption, what we can think of as the curated set of golden signals to know that your generative AI is working as you expect. But we don't stop at just the high-level dashboards. You can dive deep into the individual agents to see sessions, see traces, and metrics. So for each of these agents, you get again these golden curated metrics that are specifically designed for your generative AI workloads. For example, things like throttling rates and error rates, and the idea is these golden metrics will help you to understand if the agent is performing as you expect.

Now beyond the metrics in the dashboard, you can also drill down to view all sessions for any agent. Lots of great detail here. Here's an example. You click into a session, you see all the traces, you double-click on any trace to look at the full trajectory map. Your nutrition service calling medical history processing, checking pet types and inventory, so you see exactly how the requests flow through your system. So let's say you see an error in the trajectory map. Well, you just click on it to see exactly what went wrong. So in this case, the nutrition service can't find the pet type, and the agent expected a rabbit, but the data for the rabbit wasn't actually there. And so we can actually see the failure point right away, right there in situ in the code.

That's not all. You can also drill into the chat span and you can see the entire exchange. You can see the system prompts, the user prompts, and the tool messages, all the details you really want. You see exactly how the customer experienced your agent, and you can validate the error message, so you get full visibility into how your AI agent is actually doing the reasoning. Now one really important aspect of this, if your users share some sensitive information like names and emails, you can mask that and you can protect that PII by simply turning on the CloudWatch data protection policies, and you'll see that that's masked out right there in the log info.

Alright, so maybe you're thinking, this is great for my agents, but these agents don't work in isolation. They're part of a bigger picture. So they're calling microservices, they're querying databases, they're maybe pulling real-time data from across all of your infrastructure. And so for example, your appointment agent needs some payment processing, your nutrition agent needs inventory systems, your orchestration agent needs all of them working together. So you need visibility into the entire operational health of your application, and because when one of these things fails, your whole agent fails, and when one of those agents fails, your customer experience fails. So there's a cascading reaction there.

So this is where the CloudWatch Application Performance Monitoring, or APM, comes in. We help you troubleshoot the operational health across your entire application stack. So a lot of cool stuff happens here. We automatically discover your application topology, both the instrumented and the uninstrumented applications, and you can also just progressively, as time permits, add instrumentation where it matters the most, all with just a few clicks.

So let's go back to our pet clinic example. Let's imagine customers are experiencing some issues with the nutrition agent where the agent fails to recommend supplements for a pet rabbit. The agent appears to be running, but something in the underlying infrastructure is broken, and you are on call. You need to find it fast. So you kickstart your troubleshooting directly from the online pet care business unit, or you can go with any other grouping based on how you organize your services into production, development environments and so forth. So you dive in and you drill down to all the applications within that business unit and you filter by the failing services to highlight exactly what's broken. You select the

failing Pet Clinic front end, and you see what we call the complete blast radius. Sounds kind of dangerous, but that's the phrase we often use, the blast radius. And we trace the failure from the front-end service to the visit service and finally down to the payment processing. So each hop, and hops, these are hops not related to the rabbits we talked about earlier, but each hop shows you exactly where the request failed and why. So you get all that flow within your topology.

So next step, fixing issues is even easier with automatically generated operational audits for your applications, and this built-in view actually correlates monitoring data across your infrastructure. You get a head start on root cause analysis before you even begin your manual investigation. So we think about this, your application needs more than just operational health monitoring. You need to actually understand, and what I've shown you today, the actual end user impact. So that's where the real user monitoring and the synthetics come in.

We connect the real user monitoring, which we call RUM, and the synthetics to your application performance data, and you get a complete view across all the layers, from the infrastructure health to the agent performance, to the actual customer experience. So as you innovate with these agents, and as I'm talking to a lot of you, realizing you're making them totally essential to the way you are innovating and moving your business forward, we're going to innovate right there alongside you. And what we've shown you today, just the beginning, we've got a lot more queued up to show you the rest of this hour and throughout re:Invent, we're going to continue to work with you and build the great tools that you're going to need to succeed. Thanks so much, and I'll be back in a bit.

Operating Smarter with AI: CloudWatch Investigations and Incident Report Generation

Thank you Jeff, here we go. OK. So that was incredible. I love watching demos and I cannot wait for all of you to try it. They're already launched today. I saw them come in this morning, so I'm waiting for you all to try it. Now whether you are building agentic applications or traditional microservices, how can AI help you operate smarter? Here's the reality. Our environments are only growing more and more complex. As a result, you are managing more alerts, more incidents than ever before. How do you discern which alerts are important, urgent? How do you discern which incidents are like sub-2, sub-0, and so on?

Troubleshooting is always like finding a needle in the haystack. We use this phrase every year, but now this haystack is getting bigger and more complex every single day. So you need AI not just to help you observe better but also operate smarter. When AI does the heavy lifting of correlation analysis, root cause investigations, it frees you up so you can innovate on behalf of your customers.

So CloudWatch Investigations launched earlier this year in GA. It is a generative AI powered tool that has a capability to automate root cause analysis and gives you a hypothesis of what went wrong. It also guides you through troubleshooting as well as remediation. But what's more, I think what would be better than being reactive, even if you solve the problem quickly, is not having the problem in the first place and being in a more proactive mode. So you want to go to a preventative mode instead of always being reactive. Now how do you do that?

So one of the things that we have launched recently to help you with that is interactive incident report generation in CloudWatch. So I want to explain why this is so critical. In my conversations with many of you, I hear all the time, how does AWS handle post-incident analysis and how do you use that to improve your own operational excellence internally? How does AWS do it? So we've had this process we call Correction of Error, and if you Google COE it shows you Center of Excellence, but this is Correction of Error, and we've been using this for over two decades.

So what we do, we document everything that happens during an incident, identify root causes, take actions. How do we prevent reoccurrence? All of this knowledge is then captured in a database. And with just one click you can now, excuse me, take the same information that we have done for two decades at AWS, and now using CloudWatch Investigations report generation feature, you can now generate your own incident summaries and we have also added AI powered Five Whys workflow.

It's a process we use internally for answering the five whys behind every incident, and we have the same process available for you with this feature. This feature automatically gathers and correlates all of your telemetry, your deployments, and every text and action that you take during your own incidents and investigations, and it produces a detailed, actionable incident report. Here's why this is so powerful: this is not a generic template. This is tailored to your change history, your environment, and your architecture. It's customized for you so that you can create a similar knowledge base from which you can then have insights and take AI-powered preventative actions.

We also know that not all of your teams are in the AWS console. Developers usually are in their favorite IDEs or in AI productivity tools, so we are bringing observability exactly to where the developers are, to their workflows. We're doing this with CloudWatch MCP servers. Developers stay in their preferred tool of choice, they connect their AI agents to CloudWatch, and they have access immediately to application performance and other health information so that you can troubleshoot operational issues way faster.

Demo: AI-Powered Troubleshooting with MCP Servers and GitHub Actions

But what about other workflows like pull requests, code reviews, or deployment workflows? We are introducing application observability for AWS GitHub Actions. This meets developers exactly where they are, right in GitHub. Now you can bring your production telemetry to your code, you can diagnose your production issues, you can use live telemetry to diagnose it, you can auto-generate pull requests with the exact code changes that you will use in production, and you can instrument applications directly in GitHub repos. So now that we've seen these two features, I'd like to bring Jeff back on stage to show us how it works.

All right, so the beauty of modern operations is having some choice while you can actually maintain the speed of troubleshooting. There are a lot of different ways you might be doing that. You might be troubleshooting natively in AWS, you might be building your custom agents, or you're working in GitHub. If you take away one important thing from my talk today, we really want to meet you where you are and how you work. So CloudWatch investigations is going to complement what you're already doing. It's going to leverage AI to answer questions that you might ask, like why did this alarm fire, and it's going to hopefully do it for you faster than manually connecting the dots.

What you do is you actually configure an alarm to automatically trigger this investigation at no additional cost, so every alert turns into an instant AI-powered analysis. By the time you actually get that alert and you log on, the investigation is ready there for you to review, so you get to the root cause in minutes instead of hours. Here's what happens when an investigation gets triggered. The AI builds an internal topology of your services and your resources, and then it analyzes and correlates the metrics, the logs, the traces, and the deployments, all those things that you might have to go to several different places and look here, there, and elsewhere to get all the right information. It pulls it all together within seconds, it surfaces some hypotheses, and even suggests some actions to accelerate your troubleshooting.

What you do next is you review these key insights, and they're going to show you the actual reasoning behind each of the hypotheses, and you get full transparency into how the AI actually arrived at the conclusions that it did. All right, so from here, inside Amazon we do this process you heard we call the Five Whys. We literally dive deep to make sure we fully understand the deepest of the root causes for any problem. So we've built that same thing into this feature.

Automated incident report generation builds on our own processes that have served us incredibly well for over two decades of running AWS and all of Amazon actually. This is not simply just a generic template. It pulls from your telemetry, your deployments, and your health events, all the things that tell the actual real data-driven story of what happened in your actual environment. The timeline events are automatically pulled from the logs and the metrics, speeding up the process of actually documenting the investigation,

so you can focus on the learnings and then improving your own team's operations. These reports also include the root cause analysis that was generated using the answers to these five whys. Again, the same framework that we use inside of Amazon, we call it the COEs or the Correction of Errors, and these COEs actually become somewhat legendary over time because they're effectively the repository of all the ways things have gone wrong over years and decades. We refer to these often when making sure that as we design and architect new systems, we fully understand all the lessons learned from the now historic COEs.

But we also extend these troubleshooting capabilities just beyond the AWS environments. We're here for developers, for SREs and DevOps engineers, making sure that we meet you where and how you like to work. So we see that teams are building custom agents using their own LLMs or popular LLMs, and they're also using IDEs such as Kira. So let's take the role. We're an on-call SRE and we need to fix some code, but we're not an operations expert on this particular application. The CloudWatch MCP server connects the AI models directly to all of the tools inside of CloudWatch. And so the SRE can actually ask questions in natural language, but I think there's actually a pretty cool and amazing breakthrough. You don't need any specialized operations knowledge. So I go ahead and I ask Kira, and I literally just ask a question. Can you investigate why payments are taking too long in the billing service?

So Kira does a couple of things. It analyzes the alarms that are firing, it gets the business SLOs, it takes a look at some historical patterns. Then it investigates the log anomalies and it correlates the metrics with the deployments. And then it points to the exact method call, the one that actually caused the SLI breach, and then it goes to the next step. I think this is actually super amazing. It suggests actual fixes and optimizations. Now all this happens through natural language interaction, so there's no dashboard navigation, there's no query languages, no switching between pages and tools. So everybody on your team is now empowered to troubleshoot like an operational expert.

This year I've spoken to developers in, I just counted yesterday, 14 different countries, and I actually have been able to get some really interesting senses of what some of the more interesting challenges are. And one of the things that they tell me, and they tell me in a lot of different ways, is that they want to make sure that the operational metrics and the production telemetry, they somehow want to get that side by side with the source code, so they can actually see in a very live sense how their code is affecting operational metrics. And so we're bringing this intelligence directly into GitHub, this AI powered intelligence. So we know that as developers, we often spend a good part of our day in GitHub.

So the new AWS Observability GitHub Action, it plugs in seamlessly into your developer workflows. And it, as I'll show you, it triggers analysis with CloudWatch and with the Application Signals MCP servers. So here's how this works. So as a developer, I can open up an issue. And all I have to do is I mention the AWS APM GitHub user, and maybe I, and again in natural language, help me investigate availability issues in my appointment service. So the AI analyzes the live production telemetry, it's got access to the source code, and it jumps in and it identifies the actual root cause. So here it comes and it says remove this line from the function, the function3-different-version Lambda function. Go ahead to line 27, and we actually, you can see there that it's actually raising an exception on a particular ID. So it gives you a very specific, very detailed fix.

And from there, if you look at that fix and you say, okay, and here you have to make sure you, of course, as the developer understand that this is the right fix, you then ask the AI to implement the solution. So from there, it's going to automatically submit the pull request with precise fixes based on both the telemetry and the code analysis. So at this point, you're now in this intelligent developer partnership where observability lives directly inside of your native environment. Again, the modern operations that we're talking about today, we need to meet you as a developer and operator where you work. So we want to support you with your tools of choice. We want to make troubleshooting fundamentally easier. So again, as we often like to say, focus on what matters, building exceptional applications for your customers. Thanks again for watching.

Simplifying Cloud Operations: Enhanced Log Analytics and Centralized Observability

Thank you, Jeff. All right, so Jeff showed us how AI can help with troubleshooting and it accelerates troubleshooting for you, but let's think about an instance where you're overwhelmed with data. So AI helps you scale, but it also generates exponentially more data. This creates new operational challenges. While AI helps you solve some of the old ones, it's automatically also creating more challenges for all of us.

So think about it during a time-sensitive event, either let's say you're live streaming or you're in a product launch, or for some of us getting ready on the run to re:Invent, and you need to scale your capacity at near real time, but your data is fragmented and you don't have visibility into one single place where you can go and look at your entire estate. Now that's exactly why we are focused on simplifying operations in the cloud. It's not all just about GenAI. You need to continue to improve how you operate, how do you manage your infrastructure, and so on.

So first we want to give you faster and easier insights, and this begins with log analytics. We are enhancing Piped Processing Language, or PPL, to make log analytics faster and more accessible throughout Amazon OpenSearch Service. Now with natural language support and more than 35 analytical commands, you can pull insights from any log format using simple intelligent filter, extract, and parse. For example, with just one query, you can parse your CloudTrail logs, filter by specific APIs, you can correlate that with VPC flow logs. You don't need any custom parser or regex patterns, and you can also start analyzing immediately even if you're not familiar with PPL syntax by simply asking it in natural language. Show me the number of Lambda errors over the past one hour, and OpenSearch will do the work for you to get the right query and the right insights.

Next, we have also expanded CloudWatch Real User Monitoring to both iOS as well as Android so that we can help troubleshoot mobile end user applications for you. So now you can understand screen loads, you can see crashes, any API latency, and most importantly you can see how it impacts your end users while they're viewing whatever it is on their mobile phones. So now you have visibility into both web applications as well as mobile applications. Next, you can also correlate user behavior with logs and traces in CloudWatch Application Signals, which we launched last year, in order to give you end-to-end visibility. So with mobile support, you can also narrow down the root cause to a particular set of cohorts of users. You can do this by particular device type, operating system versions, or even by location.

But insights alone are not enough. For example, during a security event, you are correlating CloudTrail API calls. You're doing this with VPC flow logs, IAM policy changes with resource configurations, and you're doing this across dozens of AWS accounts and possibly across regions. This is no longer just a log analytics problem. It is a security investigation problem. This is why we have simplified security monitoring by launching CloudTrail aggregated events. For example, let's say an S3 bucket gets 10,000 GetObject requests, and it does this in five minutes. Let's say you need to understand whether this is just normal behavior or is this data exfiltration.

Now with the aggregated events, you no longer need to scroll through thousands of JSON events. Aggregated events lets you easily spot the pattern and answer critical security questions, so you know right away. You don't have to write custom queries, you don't need to spin up Lambda functions in order to parse the high volume of logs. It will tell you that 10,000 GetObject calls were made to your S3 bucket, and it did so with an error rate of 2%, let's say, and more importantly, it had a spike of 400% during this access.

In seconds, you now know who accessed your pipelines and which resources were accessed. CloudTrail now does the heavy lifting for you by pre-aggregating this information, so you don't have to create all of that yourself.

Next, a lot of your data lives in silos. For example, you might have application data in one account, mobile data in another, and database metrics in yet another. When you're investigating an incident, you need to connect the dots across all of these sources. This is where centralized observability really matters. Let's say you have logs that are across development, QA, production, and even across business units. It can be a challenge for you to monitor all of this across accounts and regions, so many of you have come up with your own DIY solutions to handle this with ETL pipelines.

To help remove this heavy lifting for you, we have launched cross-account, cross-region log centralization in CloudWatch. You can now consolidate log data from all of those environments I was telling you about, whether it's QA, production, development, and so on, across all of your AWS accounts and regions into one single destination account. We're also expanding this cross-account, cross-region capability to support database insights. With these enhancements, DevOps engineers and DBAs can now monitor and troubleshoot across both RDS as well as Aurora databases.

Warner Bros. Discovery: Delivering Personalized Ads at Scale with Real-Time Feedback

Now, many of you are operating at global scale, serving millions of viewers across multiple regions. Centralized observability is super critical, like we said earlier. That's exactly the situation at Warner Brothers Discovery. They do this every single day, running a massive ad platform across both HBO Max as well as Discovery Plus and other properties. To show us how they use AWS to simplify their operations and drive performance in real time for their personalization of ads, please welcome Anand Natrajan, VP of Engineering from Warner Brothers Discovery to the stage. Welcome, Anand.

Thank you, Nandini. Good morning, everyone. My name is Anand Natrajan. I'm an engineer at Warner Brothers Discovery. If you're like me, you probably turned on CNN today and checked up on the news, or maybe you caught a rerun of The Big Bang Theory yesterday, or maybe Friends. Admit it, Friends again. Maybe over the Thanksgiving weekend you binge-watched White Lotus, or you caught up with Harry Potter with the kids. If you did any of that, or if you're going to catch The Wizard of Oz or The Sphere, if you did any of that or if you're going to do any of that, you're watching content from my company.

We are Warner Brothers Discovery. We are one of the largest media and entertainment companies in the world, and we work with AWS because they are a global partner. They too have a vast global footprint just like us. I lead AdTech at WBD, and today I want to talk about some of the exciting things that we are doing with AWS.

We delight about 128 million plus customers in over 100 countries on our flagship streaming services like HBO Max and Discovery Plus. You can watch the Tour de France or you can watch the Olympics in Europe, or you can watch cricket in the UK or the NHL in the US. You can also watch premium scripted content like Game of Thrones or the Superman movie. Have you guys watched the Superman movie? If you haven't, I'm not going to spoil it for you. You'll have to find out whether Lois Lane picks Superman or Clark Kent this time. You'll have to find out whether Superman gets defeated by Lex Luthor or an AI.

I can't talk to you about any of that, but what I can talk to you about is how we do streaming on AWS. You see, we take some content, like the Superman movie, and we break it into thousands of little segments. Each of these segments is about three to five seconds long. We construct a list of those segments that's called a manifest. When a user, say Nandini, presses play on the Superman movie in the HBO Max app, we deliver that manifest in real time within a few seconds. The app on her device parses that manifest, pulls in those segments, and plays them in sequence, and that in a nutshell is how streaming works. What happens if Nandini is an ad-supported user?

Well, AdTech is unique in the streaming world because AdTech has two sets of customers. One is the subscriber users, people like Nandini, and the second is the advertiser. Advertisers care about these ads being effective, whereas users care about their viewing experience not being overly disrupted by ads.

What that means is that these ads are personalized. If Nandini or Jeff or David watches a Superman movie, they're going to see different ads. This time around, when Nandini presses play, we will fetch those ads for her in real time, splice those into the manifest, and then deliver to her again within a few seconds. Delivering the right ad to the right user on the right content at the right time is the heart of what AdTech does.

This is how we do it. When the user hits play, her request will wend its way through our backends, which includes my AdTech stack, and all these backends are built on top of a bunch of AWS services. That request will wend its way eventually to a third-party ad decisioning server on the far right there. This ad decisioning server is the one that actually picks those ads, and it's going to log its decisions, billions of decisions in a month.

Those decisions, those logs, will then wend their way back to us as a feedback loop. That's the pink arrow labeled one there. Now that feedback loop takes on the order of hours, which is way too slow for us. So we worked with Amazon and built our own feedback loop. This time around, we take the same signals that fire when Nandini sees the ad, and we fed that into Amazon Kinesis, Flink, and OpenSearch, and constructed our own feedback loop. That's the blue arrow marked two there. This feedback loop takes on the order of a few minutes.

To see why this is important, let me show you an example. This is one of our OpenSearch dashboards. This is our eyes on glass. This is what we look at when we want to check whether we're delivering ads correctly. Let me just drill down into one of those charts there. What this chart is showing you is how we're delivering those ads. Those green bars represent proper ad delivery. As you can see, as the event progresses, as more and more viewers watch this content, we will deliver more and more ads, and as the event ends, ad delivery drops, which is expected.

The red and blue tips at the top of these green bars represent problems. Those represent cases where we are either misdelivering ads or not delivering ads at all. Either way, that's a problem for us because if we don't deliver ads or we deliver the wrong ads, it's lost revenue for us. You can see now why that feedback loop matters. If it takes on the order of hours for that feedback to come back to us, the event or program may have ended by the time we even know that there's a problem. A feedback loop on the order of a few minutes is crucial for us to be able to course correct and collect revenue.

Scaling for Live Events: Custom Metrics and Predictive Autoscaling

All of this holds doubly true for live events. With live events, there's all the challenge of showing personalized ads that I talked about earlier, but in addition, with live events we are building that content manifest as the event is progressing, as users are watching it. And often we don't even know well ahead of time when the next ad break will appear. Delivering personalized ads in live events is one of the hardest problems in streaming.

And as if this is not enough, when you have 100,000 viewers watching this event, they're all going to hit the ad break at the exact same time. Recall, we're showing personalized ads, different ads to different viewers. So this means that we end up having to do hundreds of thousands of these decisions at the exact same time. That imposes a huge burden on our backends, and they have to be scaled to absorb that kind of load.

If we don't scale or if we underscale our infrastructure, we won't deliver ads and we won't collect revenue. And if you overscale, we'll incur a huge cost. Our first attempt at dealing with this situation was to autoscale based on traditional metrics like CPU and memory. But those are not quite as good for us. In this chart, what you see is the green line represents our traffic increasing and dropping, and the yellow line is our infrastructure scaling, struggling to keep up.

So we worked with Amazon on an alternative approach. In our alternative approach, we developed a custom metric we call permits, and permits bake in much more of those business factors.

They take into account content duration, DVR window, and number of ads. We took those parameters and we fed them to Amazon Prometheus and Grafana, and we auto-scale based on that. Additionally, my teams built a predictive approach whereby we look at the rate of change of parameters over the past few minutes and use that to predict the rate of change for us for the next few minutes.

Now you can see our outcomes are much better. The yellow line stays above the green line. Our infrastructure is scaling. We are confident that our infrastructure is going to scale as the traffic changes. There's still work to be done here. Ideally, what we want is that the yellow line should closely hug the green line without ever going under. So there's a little bit more work for us to do here. But all in all, I'm very proud of what we have done here.

We now have a 90% fill rate, which means that when there's an opportunity to show you an ad, 90% of the time we will show you an ad. And we do this at very low latency. Trust me, if there's one thing streaming users hate, it's when you interrupt their content with ads that buffer. So we are now able to meet the requirements of high concurrency events with personalized ads at low latency with barely a dent in our uptime. And our costs are going down and utilization is going up.

This is great news for us. Our successes depend a lot on our partnership with Amazon. We've used several Amazon services, Prometheus, OpenSearch, and Grafana. And I'm very excited to see where we go, where we take this further with our infrastructure scaling, where we take this further with traceability with OpenSearch, and in general, I'm excited about our partnership with Amazon. Thank you.

Wrap-Up: AWS Commitment to Cloud Operations Innovation

All right, can't believe where did the time go? I wanted to do a long-winded wrap-up and everything, but my clock's telling me we're already at time, so I'm going to rush through this. So our commitment to you is we are improving our cloud operations support for all of you, especially along your agentic journey for sure, all the while we're supporting hybrid and multi-cloud environments for you. Now we have a lot more coming, so watch this space and don't wait till re:Invent next year. Watch for us doing more regular launches, especially in this age of AI.

So let's quickly recap everything we covered today. We showed you CloudWatch Generative AI Application Map. We also, Jeff showed you how to operate and use MCP servers and CloudWatch, GitHub Actions. And then finally we showed you how to simplify operations and you saw Anand show you some of the features in OpenSearch as well as Mobile RUM. Please try that for your end users. And then I want to especially thank David Provan from PGA Tour and Anand Natrajan from Warner Bros. Discovery as well as Jeff Barr for joining me here on stage showing us innovation can take many, many forms, and we are here to support you whether you are running and testing agentic AI or running battle-tested mission-critical applications. We are here to support your journey.

Take a look at this. Take a picture of this QR code. There's some talks here, but there's a lot more to see. There's so much opportunity here. I want to give you one sneak peek. Please attend Matt Garman's keynote tomorrow where he has several other launches from my teams on cloud operations in particular. I guarantee you you're going to love it among all the other announcements that he has.

So with that, I hope to run into many of you during the chalk talks, breakouts, etc. Most importantly, connect with the brilliant minds here. It's a rare opportunity that we all get together. And our partners, customers, the demo booths, check out the Grafana lab and most importantly have fun and I hope to run into some of you out there this week. Thanks again for spending the hour here both here as well as online. Bye.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community