Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Observability for AI Agents and Traditional Workloads (COP335)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Observability for AI Agents and Traditional Workloads (COP335)

In this video, AWS presents CloudWatch Application Signals, demonstrating how it addresses observability challenges in AI-powered applications through automatic service discovery, OpenTelemetry-based instrumentation, and unified monitoring. Key features include uninstrumented service support, cross-account application maps, adaptive sampling, and MCP servers for AI operations. Live demos showcase troubleshooting GenAI agents, mobile Real User Monitoring for iOS/Android, one-click instrumentation enablement, and GitHub action integration for automated code fixes. CCC Intelligent Solutions shares their implementation experience, achieving 50% MTTR reduction and 40% cost savings while processing 5 billion daily transactions across 17 million annual auto insurance claims using Application Signals' transaction search, application maps, and AI operations investigations.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

AI as a Transformative Tool: Introduction to Application Observability Challenges

Good morning everyone. Hope you're having a good morning and welcome here. We'll try to make this morning go even better for you. To start, I want to give an example that Jeff Bezos gave in 2005 when he was giving a lecture to business leaders. He gave this example of how in 1857, commercially produced toilet paper was introduced, and ever since, it's really difficult to imagine going back to not having it. That had that much of a transformation.

So I think AI is like that. It's a tool that once here, it's going to be really hard for us to imagine going back and not having it. I certainly am one of them. I can't imagine doing things without it now. So today we'll talk a lot about it. My name is Igor. I run Application Observability in AWS and co-presenting with me today is Siva, Senior Solution Architect at AWS, and Mutu, Senior Director in CCC Intelligent Solutions. We'll be diving into the topic of having AI in your applications and being able to run your business when AI is in it, and as well as using AI for your own operational productivity. So we'll cover those topics.

Here's the common challenges we've continued to hear. We've heard them a lot before. We continue to hear them now from customers like yourself. So applications are increasing in complexity. There's more microservices, and these days with agents, you're going to have agents crafting SQL queries to your databases you never imagined existed, right? So it's ever more important to understand the end user and the customer impact when AI is introducing new things in your applications without your developers even designing it and programming it, right?

Number two is standardization. So we heard commonly that it's really difficult to operate multiple services and organizations when everything is different. Everybody invents their own metrics, their own log line, their own ability to represent the transaction. And so not only does it reduce your productivity, but it also affects mobility of your engineers among teams. When they join a new team and everything is new and everything is different, then productivity is lower when they go on call and try to understand what's going on.

The third thing is the ability to prioritize. So not every anomaly affects your end users, and so it's really important to understand what actually matters and what doesn't when they happen, right? AI is helpful, quite a bit, but usually you need to give a little bit of guidance to yourself or even AI to understand what is really important, which is a customer affecting incident and which isn't.

The fourth thing is just disjointed experiences. Multiple tools, one tool for logs, one tool for traces, another one for metrics, and everything is disjointed. And so when there is a spike of latency, it's really hard to understand what caused it, and you spend a lot of time going into a different tool or understanding how to share context and correlate things.

So those are the common challenges. They don't go away when we start to apply or introduce AI in applications. Moreover, today, the entropy in your system is limited by the ability of humans usually to make mistakes and, you know, maybe sometimes you have bugs checked in and so on, or the ability of humans to create traffic against your website, your digital business. And so now with AI, that velocity of that entropy is just going to increase, right? So as I said, AI can just decide to use new tools, create new dependencies that you didn't have, right, based on some question that somebody asked, your end user asked. So it's ever more important to deal with these challenges.

CloudWatch Application Signals: Standardizing Observability with OpenTelemetry

So a couple of years ago we've introduced Application Signals in CloudWatch to address these challenges and be able to manage applications, manage your business transactions. So what it does is it provides discovery of your services, discovery of applications, the dependencies.

It standardizes operational practices around how to look at daily audits of your services, what's important, and what isn't important. These are things we actually do ourselves at AWS for our own services. Service level objectives allow you to prioritize and declare digital journeys, like mobile users logging in or making payments, indicating that's important to you. There's also an ability to correlate between metrics, logs, and traces in one unified experience so you can do root cause analysis immediately without jumping between tools.

How does it work? We collect telemetry using OpenTelemetry. For managed services like Lambda and EKS, we embed that into our runtimes and into our managed services. Otherwise, you can simply take the agents and then collect the telemetry needed for this to work. We collect the same way we collect telemetry from open source agents that are implemented in LangChain or Strands and so on. We vend OpenTelemetry for you so that it's security scanned and in a package that has all the security elements that allow you to authenticate against AWS services very easily.

Then telemetry gets collected either through CloudWatch Agent, which is an OpenTelemetry agent, but we put a lot of improvements in it in order to compress telemetry, to make adaptive sampling work, and a lot of things that help you with the cost of telemetry. Or you can send it directly to the OpenTelemetry endpoint that we have. Either way, you can send us telemetry. It gets processed and collected in CloudWatch and X-Ray. Think of it as one product where we derive topology, we compute the metrics for you, and so on. Then Application Signals is just an experience that allows you to look at your transactions, look at your application map, look at your anomalies, including now agents.

Four Key Capabilities: From Automatic Discovery to AI Integration

So what are the three things? The four things that are important about Application Signals this year. First, you can see a complete picture of your application even without instrumenting it in many cases. I will show you that in a demo. We've reduced a lot of manual work required to even understand dependencies. Even if traffic isn't flowing, we know it based on metadata and configuration.

Number two, you can organize observability the way you work. What we mean by that is we automatically detect applications, we group things by connectivity, we group things by similarity of names and similarity of attributes. But we also allow you to tell us how you want to see it, like your business unit or your teams. We can organize applications by the way you are working. We'll show you that in a demo.

Number three, you can resolve issues in minutes very quickly using automatic operational audits. The things that developers do daily, we do them for you on the backend. We audit every dependency you have, we audit every latency you have, every API, and we derive the necessary insights for you to quickly address what is important without digging through and figuring it out yourself. We also do change orders. We detect changes in an application, your new deployments, and understand what part of it is affected and that you have a new dependency because of this deployment, for example. We'll show that in the demo as well.

And four, very important, is that you can connect this tool to your own AI productivity tools. Your own AI, whichever one you choose, your agents, you can connect these operational tools, and they're designed, our MCPs and so on, they're designed to help AI to be productive. You can save tokens by not scanning the telemetry fifty times for the same question. We provide tools like these operational audits, the change audits to the agents, so it can get productive.

What's New in 2024: AI Agent Instrumentation, Cross-Account Maps, and Enhanced Monitoring

So here's what's new this year. Number one, in Application Signals, we instrument AI agents with OpenTelemetry. We allow that to flow into the system and for you to see agents in your apps, how they interact, what they do, what databases they use, what users they respond to.

Number two, we do automatic app discovery. We used to do just a flat list of services. Now we group them into meaningful units that we understand you may have automatically. We also allow you, as I talked before, for you to tell us how you want to see it. Declare an attribute and say that attribute is my business unit name, so every time, just please show me by business unit. So we'll allow you to do that.

We support uninstrumented services for ECS and things like Lambda, API Gateway, many different services that comprise your application, even without any instrumentation or any agent. We draw the maps, we understand dependencies. We provide operational health audits, change impact audits. We now have cross-account application maps, so you can see apps or parts of your application that are, for example, shared in accounts or calling across multiple accounts. We support automatic application discovery and instrumentation in EKS clusters without you actually having to install anything. So we would, in a safe manner, we will go and put code in the right places so we can instrument services running in EKS clusters for you.

We now support adaptive sampling. We used to do just fixed sampling, which allows you to do cost-effective observability. The sampling rate will increase when there is an anomaly and you need to know about it. We also support exemplars on things like dependency, so if you sample, we actually take important things and we put them on the side for you to inspect later.

For AI Ops, we released MCP servers that, as I said, all the tooling that we provide for humans, we provide them for agents, so agents can be productive with operation and observability of your applications. We also launched the GitHub action that allows your developers to leverage all the production telemetry to improve code. We'll do a demo of that.

Real User Monitoring, we expanded it to cover mobile, iOS, and Android. We now support source maps which allow you to understand the exact code line, even if your web JavaScript code was minified. And then we support OpenTelemetry ingestion for all the mobile telemetry.

In Synthetics, we support multi-browser tests, so you can have one task doing all the browsers for you at once. We support Java. Synthetics, we used to support all the languages and Playwright and Selenium and Puppeteer. Now we added support for Java. We allow you to have dry runs, so you can have a dry run of your canary before you put it in production and see what it would do and how it works. We now support small, tiny API checks or just ping checks in an efficient manner. And then now we have region parity in all the commercial regions, and then we support GovCloud.

So these are the new things that we've introduced this year, based on the feedback and the requests that you guys made, basically. So now, to see it in action, I'm going to invite Siva on stage to show us all of this working. Thank you, Siva.

Demo: Troubleshooting GenAI Agents with Operational Audits and LLM Event Tracking

Thank you, Igor. Hello everybody. My name is Siva Guruvareddiar. Welcome again to re:Invent. So I will be showing you a couple of demos on the latest and greatest features with Application Signals, but let's take a step back. Customers want their application to be performant and available for the end users. For that, they want their observability to be centered around the way their teams work to make maximum impact for their end users. We listened to all these feedbacks, and so that's what we are going to show you in the last one year, whatever we have approached in this Application Signals.

So, we automatically do operational audits on your application APIs as well as dependencies in real-time, pinpointing root causes, so that it is going to be a time saver for your on-call engineers. So with every trace, you might be wondering trace is more of a technical thing, how you are connecting that with your business insights. So, every time when you're sending a trace, and then when you wanted to convert that into a business insight, let's take an example of a GenAI agent.

Generative AI agents have their own minds, and you want to understand how the minds of these agents are working. In this case, with this Application Map, this is a new feature we have added. You can have your custom grouping. Currently, I have my application level grouping, and when I'm searching for my breaching SLIs, it's coming up with a bunch of tiles. Continuing with the same demo, we are going with the PET Clinic application.

In this case, I'm running some PET Clinic agents, and currently you could see it is breaching an SLI. Instead of going into the weeds, figuring out logs, metrics, and traces, I just click on view mode and using the Operational Audit feature, it's easily seeing there is a dependency availability issue. Then I can see there is a fault rate increase with this one. But I wanted to understand what's going on here. There is a sample trace link over there, and then I can click on it and then it will take me to an X-ray page wherein every bit of information is seen.

The customer is coming into my PET Clinic front end, and then they're calling the PET Clinic agent. In the spans timeline, I see a couple of exceptions. This is the very first exception, and this is where I'm seeing the exception that is being shown in the UI. But let's take a step back again, because we are talking about generative AI agents. I wanted to see and understand what is going on with the PET Clinic agent. Currently again I click on that view exception and this is exactly telling me what is the exception, where it is coming from, which line number it is, which file it is.

These are valuable information that I can feed to my development team to fix it, but again generative AI agents are different. I wanted to send the context as well. My developer's life will be easier if I can send the context in addition to the exception message, whatever I'm seeing here. That's what the feedback we listen to you guys and then we have this LLM event tab. This is exclusively available for generative AI agents and from here we are sending all the context. These are all the context we are sending to the agents. Not only that, when that exception happened, what is the exact question that customer has asked, so we are capturing that information as well. Using all this information, it is easy. Currently, I can take all this information, and then I can send to my generative AI team, and then ask them to fix it.

Mobile Real User Monitoring: Identifying and Fixing Android App Crashes

Let's move on. The PET Clinic application that we have is so popular that we have launched mobile application, both iOS and Android. But I wanted to understand how my users are perceiving this user behavior from my mobile application. We are happy to introduce, we have added mobile Real User Monitoring feature recently, like the last two weeks back. Currently, if I go to my PET Clinic Java, I do see there are three different app monitors, one for web, one for Android, and one for iOS.

From the looks of it, my Android is not doing good. When I'm looking into the details, the right drawer comes in, and then I see more than 300 crashes. I want to understand again, what's going on here. I do see from my PET Clinic Android app monitor, many of these crashes are coming from only one screen, called Owner Details. I wanted to fix this ASAP because I don't want my mobile users to be disappointed.

I really go to this particular place, and then from here, I'm taking a random session, and from here, the exception is clearly seen, including the line level details. I just copied this exception. Let's use it later, but before getting into the technical side of the house, I wanted to see what is the user impact. I want to go to the impact analysis. From here I can see in the last three hours alone there are 240 plus users are impacted. There's a pretty serious issue. I just want to fix it.

Let me open my Android Studio. Fix the thread dump whatever I got from my Application Signals page and that's pretty much it. It's going to be pinpointing exactly where it is failing and from here looks like we are only receiving only one attribute, but we are processing three attributes so that's what the exception is saying. For now, I just delete these two attributes that I don't want to it because I just wanted to ensure this is exactly the root cause. I just delete all the things which I don't want to edit, and then I'm running in my local environment.

In my emulator, my application loads up and I go to that exact same screen, Owner Details screen, which was crashing earlier. Now it is no longer crashing. With this feature, if you're running both mobile as well as web applications, with the Application Signals feature, Real User Monitoring feature, it's easy for you to figure out what's going on.

One-Click Instrumentation: Enabling Application Signals for Uninstrumented Services

Then when there is an issue, it's easy for you to fix it, right? So let's move on with this one. Show me your hands if you're running applications where some of them could be uninstrumented, some of them could be instrumented, and you want an easy button to enable your instrumentation flow. That would be awesome, right? So how many of you are thinking that way? Yeah.

We are happy to introduce that in addition to the instrumented flow, we are adding or we are showing you both the instrumented as well as uninstrumented flow in our application map. To prove the point, I have two different but similar applications. One is called Ticketing Lite, and another is called Ticketing. The Ticketing Lite is not instrumented. That's what you are seeing showing in our dotted line. I wanted to understand why there is an SLI breach. Not only are we showing you uninstrumented applications, but also you can create some SLIs on top of it. That's why you are seeing there is an SLI breach.

From the looks of it, it is calling a bunch of services which is not having any interaction at all, and I don't know what's going on. There are a lot of availability issues. There are a lot of faults that are happening here. If I go into one of the services, the SubmitTicketLite Lambda, I do see somebody has deployed. So there is a lot of chaos. I wanted to understand what is going on. But unfortunately, I have not instrumented any of this thing, right? If I wanted to understand this, then I need to ask my developer to instrument it, deploy it again, and then I need to figure out the root cause.

That's why we listen to the feedback, and then we are enabling the Application Signals. Now, with just one click of a button, I just go there. In my grouping of this application, I click on the three-dot button and then say enable Application Signals. It will take me to the enablement flow. From here, all the services which are grouped in this particular one, based on the way we are understanding, will be showing you the different services. It's up to you to pick and choose which services you want to enable.

Once I enable it, maybe after five minutes or so, this is going to be the Ticketing service will be showing how it is working under the hood. Now it is making much more sense, right? Because I'm seeing there is a ListTickets Lambda, which is calling, there is a CreateTickets Lambda which was no longer in the picture. We have not seen this. Also, it is doing a bunch of database operations, SQS messages, etc. Now, if I click on the View More, and then I want to understand what's going on here from the Operational Audit, it's clearly saying we have been sending some messages to SQS which is pretty lengthy in nature, but SQS can accept only that many number of bytes.

With this feature, we have not opened an IDE, we have not asked the developer to instrument the code. It's easy. Just one click of a button, you can enable it. Even though I've shown the demo with Lambda, we are supporting other services as well, including EKS. So let's move on, right?

Change Impact Analysis: Detecting Deployment Issues and Rolling Back with Confidence

Again, show me your hands. If you are a DevOps operator and then things were working fine in your production, and then suddenly somebody's deploying some changes, things are broken, and then you are running post to post to figure out, hey, what's going on, right? How many of you have faced these kind of things? It's all in a day of a DevOps operator, right? I just want to show you how you can fix your applications before it is impacting your customers, right?

For this, I have an audit service. There is an audit service which is having an SLA breach. Again, I'm showing you there is a last deployment that has happened 25 minutes ago. As a DevOps operator, I don't know because somebody has deployed it and I wanted to understand why they have deployed it, what kind of changes that they have done and all those, right? When I click on the View Details, it will take me to a CloudTrail page. From here all the information is shown here, who has deployed it, what time they have deployed it, what kind of configuration changes they have done, all the information. Now I have heuristic, but I don't know whether this is the root cause. I just wanted to do that.

Now I go to my red metrics, and from here I can see right after deployment I see there is an availability issue and there is a fault. Now I'm convinced that this is the culprit and I just wanted to roll it back. I go there, delete my latest deployment because which is causing all the issues, and then once I delete it, it is rolling back. Maybe after five minutes or so, when I come back to my screen, now the SLA has recovered. But I just wanted to understand whether this is indeed solving the problem, right? Again, I click on the View More. This time it's showing there is a deployment 32 minutes ago, which is what our rollback is all about, right? And now I can see from the red

metrics that after my second deployment, which is right after my rollback, the availability has gone up to 100% and my 500 errors have gone to 0. That's exactly what we expected. You can easily see these kinds of things. We are not only tracking your deployments but also capturing any config changes. If somebody goes and changes any of your config, like your time or parameter or anything for that matter, we will capture all the information so that everything you are looking for is available in your dashboard itself.

Cross-Account Application Maps: Unified Visibility Across Multiple AWS Accounts

Let's move on and talk about cross accounts. Show me your hands if you are running in your environment or in your entire environment multiple applications across multiple accounts. How many of you are already doing it? This is a classical problem. When I'm talking to customers, they all run their own accounts. Sometimes the front end will be having their own accounts, and the back end teams will be at their own accounts. When I'm talking to financial services customers, they have maybe lending in one account, and then loan could be another account, and so on. But all the applications are talking to each other all the time. Will it be useful if I'm a service operator and I just wanted to see everything from one place? That's what I wanted to show you.

Let's see this in a quick demo. Currently, I have an appointment service gateway which is running cross accounts to prove the point. When I am going into the tile and then I click on this view more option, I can see from my accounts tab there are two different accounts it is running. When I'm getting into the beats, I can see it is calling a bunch of services. From here, I can group it by account ID. I'm saying show me all the services in account ID 2341, and the corresponding services are shown. I can rinse and repeat the same thing for my other account as well. I do the same thing with the other account, which is account ID 4084. In this case, the corresponding accounts are coming in. Not only that, whatever we talked about like operational audit and the change indicators, everything is available even though it is coming across accounts.

You might be asking the question, what is the connecting tissue here? The connecting tissue here is a database. I can do that as well. I can say the dependency type is DynamoDB. That's the database we are using here. From here, I can see with my DynamoDB both services from different accounts are showing up here. Like this, even if you're running hundreds of applications with hundreds of services or hundreds of accounts, it's easy for you because everything is available from one place.

AI-Powered Support: Using MCP Integration and GitHub Actions for Faster Issue Resolution

So far we have seen the traditional life of an operator, but you might say this is an era of AI. What are the things that you are having in store? Let's see a story, a simple story. Again, this is our Pet Clinic application. We have introduced a chatbot called PetPal AI. With PetPal AI you can ask any questions. Users can come in, they can schedule appointments, they can ask any questions for their nutrition information, and so on. Now I'm a customer, and I'm coming to the ReInvent Pet Care, our pet clinic application, and I wanted nutrition information for my pet rabbit. When I ask the question, the chatbot is giving me three different options. I am a customer, so I can choose whichever option is okay for me. I'm choosing the third option. I would like to place an order with this particular one. I'm saying okay, go ahead and place an order for this particular supplement that I'm looking for, for my pet rabbit.

With this one, I'm expecting that to succeed, but unfortunately here the chatbot is saying it's not available. I'm a user. I'm kind of dejected, but I have some other options. I go with the second option, and then I say okay, you said the third option is not available, go with the second option and place an order. But unfortunately, again, it's coming back with the same thing. It is recommending products, and when I try to place an order, it says it's not available. I get a little frustrated and then I do the similar way like how a human can do. I say tell me exactly the pet food that you have in stock, so that I can be sure that I can place an order. Again, the chatbot is coming up with three different other options for me, and I go with the first one, and then I say this one, you said it is available in stock, go ahead and place an order. But again, unfortunately the same problem. It says there is a technical issue. I cannot place an order and all those things. I get frustrated.

I go to my pet clinic support system and create a feedback ticket. I'm saying, hey, your chatbot is recommending me products, and when I try to place an order, it's not working. What's going on? You might be having a lot of users or customers who are not happy with the agent and the way the agent is working.

Now I'm a support engineer in the pet clinic application. I come to my dashboard, and many of the customers are saying the same thing. Everybody's saying the chatbot is recommending products, but when they try to place an order, it's not working. In a traditional manner, if I'm a support engineer, I should have gone to logs, metrics, and traces. But this is an era of generative AI, so I open up my Kiro IDE, which has MCP integration with Application Signals.

From here, I ask the question. As you can see, the way a human does it, it performs a lot of operations. It talks to MCP tooling, multiple MCP tooling for that matter, and as it's getting a response, it knows what kind of questions it needs to ask. Finally, it comes up with a hypothesis. It says these are all the key findings, and these are all the next steps. As a support engineer, I get some information, and this is easy for me compared to going through logs, metrics, traces, diving into the code, and all that.

Now I want to take this information and go to my development team. In this case, I'm just using a Slack channel. It doesn't mean you have to use this as the only medium; it's just an example. Here I'm saying, hey, all our customers are complaining about this one. Looks like our chatbot is broken. Can somebody go ahead and fix it? To sum it up, as a support engineer, instead of doing things the manual way in a traditional manner, they can leverage the help from MCP and other generative AI tooling.

Developer Workflow: Automated Code Fixes with AWS GitHub Action Integration

With that, let's go to the last one. We have recently introduced GitHub action integration. Let's continue the story. To sum it up, we have our support engineer complaining that users are saying they are not able to place an order even though the chatbot is recommending products. Now it's coming to me, and I'm a developer. Again, instead of doing things the traditional manner, I'm using the AWS GitHub action. I say AWS APM, this is my scenario, fix it.

Now the GitHub action fetches the code from GitHub, and it knows where my applications are running. It does basically the same correlation that MCP has done, and then it comes up with its own hypothesis. You might be asking how this is different. As you can see here, this is a little bit different because we are seeing file-level details and line-level details about what is going on. In this case, the AWS APM agent is saying, hey, these are all my recommendations, these are all the issues. What do you want me to do?

I'm a developer, I'm in the driving seat, so now it's up to me to figure out how I can fix the issue. Now I say, okay, hey AWS APM, go to this particular line, go to this particular file, and then fix this. All I'm saying is go to this nutrition information Python file, go to line number 76 to 88, and change the system prompt. Also, even though the agent is saying the validation layer is a low priority item, as the developer, I know this is a high priority item for me, and I'm saying fix this validation layer also. Not only that, I am putting some guardrails around it, and I'm saying do not do any kind of database modification because I don't want my agent to do some crazy database operation.

Once I do that, I'm asking the AWS APM to create a pull request with all the changes incorporated. That's what the AWS APM GitHub agent is doing. After some time, it comes back with this particular result. From here, I'm going into the pull request. I understand what the problem statement is that it understood, what the solution is that it's doing, and what kind of changes and testing it has done. I go to the file changes because I want to review the changes that have been made. Here I'm saying, okay, the validation layer, I asked the agent to change it, and that's what it did. I'm convinced that whatever I asked the agent to do, that's what it did, so I want to approve it.

I go to my tab and say LGTM, which is looks good to me, and then I say approve. Once I approve it, somebody merges the code, and then the CI/CD flow kicks in. After that, my changes are going to staging, and I want to test the same flow again. Now I'm coming back to my same flow, and here I'm asking the exact same question my customer is asking.

This time I'm asking, hey, what are the pet food options available for my pet rabbit? But this time the answer is different. It still says I don't have the details, but I don't know anything about your pet. I can schedule an appointment for you. So I say I'm available on Saturday, book an appointment for me, and it has already booked an appointment for 11:30 on Saturday.

So this way, it is easy to sum it up. Whatever you have seen, the operational audits, the uninstrumented flow, the MCP integration, and then GitHub flow and all those things, with all these latest and greatest features we are offering with Application Signals, you can make sure your observability journey will be easier with Application Signals. But don't take my word for it. With me, I have Muthu from CCC Intelligent Solutions, who is coming here to talk about how Application Signals and CloudWatch as a whole is helping their observability journey. So without further ado, I would invite Muthu to the stage to talk about their journey. Thank you.

CCC Intelligent Solutions: Managing 5 Billion Daily Transactions in Auto Claims Industry

Thank you, Siva and Igor, for the informative and power-packed demonstration. That's a lot of new features. Well, good morning, everyone. My goal here is to share my experience and help you walk out with a few things that you can apply in your environment. I have 25 minutes, so let's dive in. Can anyone guess how many cars are involved in accidents per year in the US? Can anyone make a guess? Please raise your hand if you think it's 5 million. 10 million. 20 million.

In the US we have approximately 300 million registered cars, and 7% of the cars are involved in accidents every year. It means 20 plus million cars are involved in the accidents every year. So you must be wondering why I'm talking about car accidents in the CloudWatch meeting. Well, myself, Muthu, Muthukumaran Ardhanary from CCC Intelligent Solutions. CCC is the leader in auto claims and collision repair industry. When life brings unexpected, CCC gets to work. We unite the entire industry, auto insurers, repair facilities, automakers, parts suppliers. We bring them together to help people get back on their journey, get back on their life, get back on the track.

So to achieve this, CCC makes billions of decisions every single day. And we work with 300 plus auto insurers, 30,000 plus collision repairers, 5,000 plus parts suppliers, and we handle 17 million claims annually. To support this, we have thousands of APIs and we have several portals running on thousands of servers and on hundreds of databases. And every auto claim goes through thousands, if not hundreds, of transactions before the car is fully repaired and handed over to the customer, the car owner. Our systems handle 5 billion transactions in a typical business day. And we have a sophisticated system to handle the billions of transactions to ensure smooth operation for the system. We have a team of SRE and I manage the SRE team at CCC.

Well, so we live in a world where accidents happen, even if you are in a self-driving car. There is a possibility you may get into accidents. Maybe not your fault, maybe not the self-driving car's fault, but you may still get into accidents. There is a possibility. Similarly, if you have even best in class systems, the systems may run into failures or issues. At times you may have to deal with some of the issues in the systems.

So my theme today is all about accidents, I mean system failures or system issues, and how we deal with that using the CloudWatch Application Signals. Today, I will show you how CCC is using CloudWatch Application Signals to troubleshoot issues and what are the things that we learned working with Application Signals for the last few years. And what are the key benefits, the business metrics that we see, how we measure, and what we see from the benefits standpoint.

Okay, let's start with the APM journey. CCC started our APM journey with Application Signals around 2023, late 2023 or early 2024, and CCC was moving all our workload to AWS. At the same time, we were doing a modernization as well. We moved all our workloads from traditional EC2 to the EKS platform. So we were looking for a cloud native solution that can support that observability. And that's when AWS released that in the re:Invent, around late 2023.

And we started doing a proof of concept. There are four things that stood out for us during the POC. One is the automatic instrumentation. The other one is the unified workflow and the pre-built dashboards, and the cross-account observability. Cross-account observability is something that I love. Basically, it's a kind of single pane of glass. If you have applications running on multiple accounts, you don't want your SRE to log into multiple accounts to troubleshoot. You just need one account set up as an observability account. You just go to one place to manage everything in your infrastructure or in the workloads or environment. That's what the cross-account observability is all about.

Troubleshooting with Application Signals: From Transaction Search to Interactive Application Maps

Now let's look at a few scenarios, how we troubleshoot issues. As you can see in this chart, this screenshot shows some spike. This is a latency spike. You see the baseline is around 1,000 milliseconds or something, and then the latency spike is up to 2,000 plus milliseconds. When you don't have proper APM tools or when you don't have a proper process in place to troubleshoot such issues, when you have an alert, what you do is you go to look at the alarms. Obviously you start with, or you look at the metrics or synthetic monitor, logs or dashboard. All of this will tell you something, but not the root cause. All of this will give you some information about the issue, but it will not help you with the root cause.

What we do is when we have such a latency spike alert, we go to transaction search. From the transaction search, on the left-hand side, if you look at, we search for the spans with a higher latency. In the spans, when you search for the higher latency, it gives you the traces, corresponding traces with the higher latency. If I click on the trace, it will show the span time. Here in this case, it shows the post, the query SQL query that I'm running is actually taking longer. In this case, it shows it's taking 1.23 minutes. If you click on the line, it will actually show the actual select query that is executed. Obviously, you cannot see it. I have masked it for security reasons. You will be able to see the select query from this. Once you have the query, you can actually, if you have a database running on AWS with performance insights and CloudWatch, you'll be able to go there and even understand why that query was taking time. Maybe some index is missing or any other reasons, you will be able to quickly understand. This is one way to quickly troubleshoot an issue when you have an issue.

Let's look at other ways to troubleshoot issues. The application map is something that was recently launched, and we were a beta customer working closely with the product team, sharing all our feedback. This is something that I love most because it gives you the visual representation of all your workloads in your environment. Basically, what you see is multiple tiles here, and each tile represents the services in my environment. Each tile is a workload, and each tile actually understands the interaction of that service with all other services in my environment. When you click on each one of them, it will give you the interactive map of that system in your environment. You will see that in the next slide.

This also has the real-time health of your system, which you can see in the top. You see there are circles with the red, and some of them in the middle are yellow, and most of them are in blue, which means they're all healthy. When you want to troubleshoot something, what you can do is you can simply filter based on what you are interested in. In this case, I'm looking for server faults. You can filter it by server faults, and it shows there are four services or the workloads that have errors, server faults, and then I can just click on one of them to zoom in and understand.

If I click on one of them, it opens the interactive map, and that helps me understand my service is talking to a database and there are other multiple remote services I'm talking to. With that information, I'll be able to get the complete picture of my service. If I want to understand the root cause, when I click on the service again, it actually opens that contextual troubleshooting drawer with the relevant metrics and insights. Here it shows all the health information of that and why it is even slow. From here, I can even further go to that correlated span that actually shows the trace for the corresponding time period, why that was actually slow.

You'll be able to go to the correlated span to get to the trace. You don't have to go to the transaction search. From here itself, you'll be able to get to the trace. When you click on the trace, you will be able to get to the root cause, you can identify it. But if you are interested in going to that dashboard, want to see that, you can click on that view dashboard button. This takes you to the Application Signals dashboard. If you are a visual person, if you have an L1 resource who doesn't understand the complete picture, they need a visual representation to navigate through their troubleshooting process. Application map is really helpful. Also, if you are looking for the complete picture of all your workload in your environment, you can use it, and we started using this for all our troubleshooting.

Hurricane Response and AI Operations: Handling Volume Spikes with Automated Investigation

Now let's look at other scenarios and how the insurance industry is impacted. For example, unexpected events like a hurricane and how it impacts the insurance industries and our systems.

Before we go there, can anyone guess how many cars were damaged in last year's hurricane in 2024? I'll give you the number. Last year we had two major hurricanes. One is Hurricane Helene, and the other one is Hurricane Milton. Helene impacted 140,000 cars in southeastern states, and Milton impacted around 120,000 cars just in Florida state itself. Overall last year, we had around 350,000 cars damaged just by the hurricanes.

What happens is when a hurricane occurs, people stay indoors. They want to stay in a safe place, but the cars are on the road. Even if they're in the driveway, you know what happens with the hurricane. It damages the cars. When the rain stops, people want to get back to their life, and the first thing they need is their car. All of them, the moment the rain stops, everybody picks up the phone and calls the insurance company to get the car repaired.

Imagine a million people calling the insurance companies around the same time. The volume of claims goes up by literally five or seven times. Our typical day is like we get 50,000 to 60,000 claims. That's 60,000 accidents in a day that we deal with. But in a short span of time, the volume goes up by five to six times. Fortunately, we have a system that can automatically scale because we have a state-of-the-art system that automatically scales and handles the volume. However, there are times we may run into some small performance issues or minor things that we have to deal with. I'll show you how we handle such a situation using the AI operations investigations.

Let's look at this trace map for a second. This trace map shows multiple services are involved in a single transaction. You see there are multiple services and databases. The transaction is going back and forth. This is one of the complex transactions that we have in our environment for our claims processing. When you have such complex transactions, you cannot just look at one data point to come up with a conclusion or a root cause for the issue. You have to look at multiple data points.

When you have to look at multiple data points, what it means is you are actually taking more time. Your MTTR is going up. If your MTTR is going up, you are impacting the customer, and the customer is not happy. What we do is we use the AI operations investigation. The AI operations investigation is agent-based. Behind the scenes, it runs the agent. It's part of CloudWatch itself. It runs the agent behind the scenes and looks at multiple data points within a short span of time.

It significantly reduces your MTTR. It reduces the MTTR by 95 or 98 percent. It takes only three minutes, maximum four minutes, for all the data analysis to complete. I'll show you in the next screen. You can launch that AI operations investigation from the Application Signals page itself, or alternatively, you can launch that from the application map itself. If you launch it from the application map, it automatically understands the context. It takes the time, and it uses all the metrics at the time frame that you are actually starting the investigations.

Once you start the investigation, you can go to that. Behind the scenes, it runs the agent. You can access it from the AI operations investigation. The agent shows what are the different data point telemetry data it is looking at. All the telemetry data analysis is available here. In this case, it's running 42 tasks. It analyzes the database and all the component services that the transaction is involved with, and it gives you the summary of hypotheses.

Anytime you have an issue, you need three things. One, you need the data points. What are the data points that you have to analyze? Then you need to come up with a hypothesis based on the data points, and then you come up with the action plan. The agent is already looking at all the data points for you, and then it comes up with the hypothesis and findings. As you can see here, it has a hypothesis. There are some database-related hypotheses here, and then it gives you all the findings.

Based on this information, you'll be able to come up with a conclusion. This is what you have to focus on. Your action item is very clear at this point. All of this is done within a few minutes. But if you have to have a human SRE do this, it may take one hour or maybe two hours. You may still be looking at the data point. You may not conclude anything, but everything is done through that AI operations investigation by the agent. This is going to augment your SRE and is really helpful when you have complex troubleshooting.

Gen AI Observability: Monitoring Bedrock Models for Token Usage and Invocation Latency

Well, now let's look at Gen AI observability. So CCC offers AI-powered tools and technologies to our customers, and we use models. We have our own models running on SageMaker, and we use models from Bedrock. So for one of the use cases, we were trying to use some models from Bedrock. That was one of the first use cases for Bedrock. And we sat with our product development team to understand the observability.

The product development team was trying to create a summary, as you can see on the screen. So on our website, we have collision repair shop reviews. We have hundreds of reviews for a shop. They wanted to summarize it using AI models. So you can see that you don't have to read hundreds of reviews. You can just look at one summary. And for this, we wanted to do some proof of concept. We were trying to understand everything from the model standpoint. But the key thing was, how do I understand the invocation latency? If the application is invoking a model, how do I understand the number of tokens it's using? What is the input token, output token?

We were looking for observability at the time, and then we reached out to AWS. AWS was working on the Gen AI observability, and everything came out of the box, so we didn't have to do anything. So the Gen AI observability is also another new thing that was released recently. The Gen AI observability has all the information that we were looking for. It has the invocation latency, it has the token count by model. It also has input-output token. Everything that we were looking for is available out of the box. And you can also look at it by model. Basically, you can click on each model, and then you can even look at, if you are using multiple models, you can look at that by model. You will have all the information. So this was really helpful for us to support the things that we were using from the Bedrock standpoint.

Lessons Learned and Business Impact: 50% MTTR Reduction and 40% Cost Savings at CCC

Okay, now let's look at what we learned working with Application Signals for the last two-plus years. So these are the things that we learned, and I hope this is going to help you in your environment as well. Maybe you already know some of them. Maybe you will learn something today that will be helpful. So transaction search, you saw how I used transaction search to troubleshoot, but a lot of times people are concerned about the cost. So I suggest using transaction search for 100% span visibility. If you're looking for span visibility at a lower cost, I suggest using the transaction search.

Let's say you have a lower sampling rate in your non-production environment for less critical applications, but you want to have 100% visibility at the time that you have an alert, you have errors, you have high latency. So you can use the adaptive sampling at that time. Basically, adaptive sampling will increase the sampling rate to the higher number that you define, may go to 100%. Once you get an alarm, basically the moment the alarm triggers, your sampling rate will go up. The moment the alarm comes to an okay state, the sampling rate will come down. So this is going to be really helpful if you want to have 100% visibility in a non-production or a less critical application. You can simply boost the sampling rate with this.

And the AI operations, so I suggest using the AI operations to augment your SRE team. This is going to be really helpful. They don't have to really spend time troubleshooting, and it's going to reduce your MTTR significantly. And then application map is for visual persons, for even a low-level or even if you are a senior resource. If you want to get a full picture, the application map is going to be really helpful to get the workload visibility in a visual way.

So the pre-built dashboard, I suggest using the pre-built dashboard for all the RED metrics and even setting up alarms. You can set up the alarms from the dashboard itself. It's very easy, so with a few clicks, you'll be able to set up the alarms that you need. And then the runtime metrics. So runtime metrics is something that we worked closely with the Application Signals team from the beginning. We were really looking for runtime metrics to understand the memory, the heap, and all the garbage collection. This is going to be helpful if you have memory-intensive applications. You want to understand how your application is performing in the runtime.

And the SLO alarms, so if you have SLIs defined, if you are really looking for SLO alarm setup, I suggest using the SLO alarms at the operational level. That's going to be really helpful to get the granular level of information in your application. You don't have to set it up at the service level. So if you have multiple operations, I suggest using and defining alarms at the operational level, at the dependency level. That's going to give you more information. And the container insights comes out of the box the moment you install the agent. That's going to help you understand the complete picture and all the metrics that you need from the container standpoint.

Now, let's look at what are the business metrics that we measure and what are the key benefits that we see with Application Signals. So overall, we see 50% reduction in MTTR.

These are the key benefits that we see with CloudWatch Application Signals. We're able to reduce the cost by 40%, and we improve productivity not just for the developers but also for our SRE team and testing team. Overall, we see significant improvement in system performance. On one side, we see that productivity goes up, and on the other side, we see system performance also goes up because we are able to proactively identify issues before even the customer knows something is wrong. We are able to address it quickly because we reduce our MTTR, and that has significantly helped our systems improve performance.

CloudWatch Application Signals can help you in many ways to reduce the MTTR and also the cost and other things. I suggest you give it a try today. It takes only a few clicks. If you haven't tried it, just go and try it today with a few clicks to install the agent, and then you will see that efficiency improving. You will see the observability stack visibility going up overnight.

With that, I would like to wrap. I want to leave you all with one message. CloudWatch can help you in many ways from the observability standpoint, and CCC can help speed up the repair process and improve the efficiency, but only you can actually prevent the accidents. Whether it's a car accident or a system accident, only you can prevent it. So always remember to drive like your grandma is watching.

With that, I'd like to wrap. Again, thank you for listening. Your attention span is stronger than my mobile signal in this building. Thank you, and I'll call Igor and Shiva back. Thank you, Mutu. Thank you, Shiva. Now, you know, if you have a claim, Mutu will take care of it. AWS and AI will help it, but drive safe. That's the first thing.

So thank you for sharing your time with us. If you like the session, give us a thumbs up review. We'll take questions here off stage, and again, thank you for being here.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community