🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.
Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!
Overview
📖 AWS re:Invent 2025 - Comprehensive observability with Mobile RUM and Application Map (COP364)
In this video, Siva Guruvareddiar and Alex Nazaroff from AWS demonstrate comprehensive observability using Application Signals, focusing on the new Application Map feature and Mobile Real User Monitoring. They explain how Application Signals provides APM capabilities based on OpenTelemetry standards and RED metrics (request, errors, duration). The session includes live demonstrations of Application Map's dynamic grouping, automatic topology discovery, and operational audit features that proactively identify service issues. They showcase Mobile RUM for iOS and Android, recently launched to track crashes and performance metrics in real-time. The demo illustrates end-to-end tracing from mobile apps to backend services using correlation IDs, troubleshooting Lambda functions through CloudWatch Application Signals MCP server in IDEs, and using Transaction Search to find specific traces stored as logs. They also demonstrate custom grouping using AWS tags and OpenTelemetry resource attributes to organize services by application, tier, or team ownership across multiple accounts.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction to Application Signals: AWS's Approach to Application Performance Monitoring
Hello everybody. Thank you for coming in. Good afternoon. My name is Siva Guruvareddiar. I'm a Senior Specialist Architect here at AWS. Welcome to this session on comprehensive observability with Mobile Real User Monitoring and Application Map. With me, I have Alex, who is a Principal Engineer at AWS, and we would like to talk about this as a code talk. We wanted to dive into the code and then talk about the various features. But before getting into the details, how many of you here are using Application Signals in your product or in your company? Maybe that's bad marketing on our side. How many of you have heard about Application Signals? Good. So Application Signals is our own way of doing APM, Application Performance Monitoring.
I would like to level set the field. We will be taking a very short, maybe theoretical concept on what is it all about, then we'll dive into the coding things. This is what our agenda is going to be. We'll talk about the overview on a very high level, then we'll talk about the latest and greatest features including Application Map, the Mobile Real User Monitoring, which we just launched a couple of weeks back, and finally, we'll get into the code things.
So Application Signals, as I said, on a high level, you can think of it as our own version of Application Performance Monitoring. We have an inventory of services that you're running in your application or in your account. You want to both understand, monitor it and observe it, and when things are failing, you want to fix it as soon as possible. So that's what it is all about. We will be showing you, irrespective of whether you're instrumenting it or not instrumenting it, all the services that are available in your account from the console. We'll be showing a demo.
Also, it's all based on golden metrics. In the APM world, the golden metrics is all about RED metrics we call it, wherein R stands for request, E stands for errors and faults, and D stands for duration or latency. So everywhere in Application Signals console, whatever you'll be seeing, it's based on golden metrics or the RED metrics, and dependency tracking. But the beauty here is you're not sending us any kind of metadata that says, okay, hey, my Service A is calling Service B. We will be getting all the tracing information based on the distributed tracing, and then we'll figure out what is your dependency.
It's all based on open standards. We are using OpenTelemetry as the underlying mechanism. Using that, we will be showing you all the details. So with that, as a next level, you want to connect your technical things with your business metrics. So that's where you can create your SLOs, Service Level Objectives, from your services. Also, in the topology map, we'll be showing you how these services are interrelated with each other, as and when your services are sending metrics to us.
With this, as we discussed, it is based on distributed tracing. Last year we introduced the concept of Transaction Search. So what we are doing is we are taking your traces and then storing them as logs. So in that case, any kind of transactions that are available in your business applications, you can go ahead and then search for it. The typical way you will be doing with CloudWatch Logs. We can do that. And last but not least, we have support for Real User Monitoring as well as canaries. So Real User Monitoring, predominantly we have been supporting web for a pretty long time. And recently, we added support for mobile applications, both iOS and Android, as well as synthetic canaries. If you wanted to run any synthetic testing on your APIs, on your endpoints, that is also supported.
Application Map: Visualizing Service Dependencies and Proactive Operational Insights
So let's talk about the newer feature called Application Map. It has come in a lot of stages. We started with Trace Map and Service Map. Currently, it is in the stages of Application Map. So Application Map, as you're seeing in the upper right-hand side picture, that's all what I will be seeing. We will be showing you a demo on this. So here is where you'll be seeing a lot of tiles. These tiles are representing your applications and then we are grouping together by related services. So that is our default grouping. But you can go ahead and then create your own custom grouping. Either there is a front-end grouping and then there is a backend grouping, or if you're running multiple business units, for example, I might be running a financial application. So that could be, my business unit could be a loan business unit could be one tile, and then my lending could be another business unit tile. So that way you can go ahead and then segregate your applications the way you wanted it.
If you click into that one, it is going to elaborate each and every service with its dependencies. The lower part of the picture shows that you can do all these things from one place. Again, each and every tile you can click on like a recursion basis. You can keep drilling down until you get to the level you want. So how do you navigate the map? As I said, we can group by the way you want it. By default, the grouping is the related services based on how you are sending the metrics, but as I said, you can go ahead and group it, and then you can zoom in and zoom out and all those things because it's a map. So wherever you need the attention, you can zoom in and zoom out and do all those things.
Also, if you see in the right-hand drawer, there is a lot of details around it, starting with your health metrics of your services. In this case, I have a Pet Clinic application, which is showing how healthy my services are, how many faults I'm getting, and then if I created any SLOs, how they are healthy or unhealthy, those kinds of information. In addition to that, as I mentioned, you will be seeing all your red metrics, the number of requests you are getting, your errors, faults, then latency, all that information is available.
One additional thing is the support that we recently added, which is something called operational audit. Operational audit is something where we are proactively saying to you that this particular service is having some latency issues and asking if you can take a look into it. It is not like the reactive days are gone. Now people want to do more proactive stuff. In this case, it will tell you not only that something is happening, but if you want to take a deep dive, you just click on one of the traces that it will be showing you, and then it will tell you exactly how it is coming to the conclusion that it needs some attention.
Real User Monitoring for Mobile: Extending Observability from Backend to End Users
So let's switch gears and talk about Real User Monitoring. Real User Monitoring, as I said, if you want to record all your user interactions from browsers, your end users might be using their browsers to get some insights, and you might be using Real User Monitoring. We have been supporting web for a pretty long time, and we recently added the support for both Android and iOS. With this one, you can get a lot of information around it. For example, again, how healthy is your application for both mobile as well as web, and then in terms of web, you will be having a lot of details around how many JavaScript errors I'm getting or if there are any JavaScript errors, or in the mobile world, it could be the number of crashes, those kinds of stuff. So you'll be getting numerous information from these particular charts.
Also, when we are extending to mobile, these are all some of the charts you are getting. The first one is a comprehensive dashboard. That's where you will be seeing both your performance metrics and then your crash analytics and all those information. All these things are coming in real time. So when a customer is getting a crash from their mobile application, it will be recorded in Application Signals or Real User Monitoring, so there is not much latency there. The moment they are getting it, you are getting it in real time.
Also, when we are talking to many of the mobile app people, they want to see the important things around how much time my app is taking to load or the Application Performance Index, those kinds of things. Again, the mobile world is split. There is iOS, there is Android, and in the Android world, we call it ANR or Application Not Responding. But in the iOS world, we call it crash analytics. So depending on the kind of mobile application we are talking about, we provide you both information in the same language that the mobile developers are using. If you are using iOS, we'll show you all the crash analytics, and if you're using Android, we'll be showing all the ANR details.
With this one, you are also getting various information like which iOS versions they are running or which version of Android they are running, what type of devices they're using. So all those kinds of information we are capturing, and then that information is available to you. So we talked about two different things. One is your services which are running in the backend, and then the other one is the Real User Monitoring, which could be web or mobile. How are they interconnected with each other? For that, we are using correlation IDs.
When your customer performs an operation, we create a correlation ID that is tracked over multiple hops. This is based on OpenTelemetry open standards. End-to-end transaction tracing is available for you, so once an error occurs, you can easily track it all the way from your mobile application to your backend service.
With the newer Application Map feature, you'll have automatic topology discovery as well as visualization, so from one place you will be able to see how things are interconnected with each other. It also shows you the service dependency, so you might be wondering how service A is calling service B, and all that information is available for you. Additionally, performance metrics overlay at each tier is displayed. I don't know whether you noticed it, but every tile has a donut chart. The donut chart is color-coded, so when you're having some issues, it will show in red, indicating that it needs attention.
On a high level, if it's a thousand-foot overview of your application, you can easily see which tile is showing you red or green. This makes it easy for bottleneck identification and visual indicators. This is how the architecture works. From your mobile applications, whether iOS or Android, you are sending your telemetry to CloudWatch Real User Monitoring. From CloudWatch Real User Monitoring, we will do the job of taking those metrics and sending them to Application Signals. On the bottom side, you have your backend services. With these backend services, you are sending your telemetry as well. Again, everywhere it is OpenTelemetry open standards.
If you are a user, in this case, assume we are viewing from the right-hand side with Application Signals and Application Map. Using that, not only are you going to monitor your applications, but if something is breaking, we are going to see how you're going to fix it. That's what the code demonstration will show you from Alex. So with that, let's go to the code.
Navigating Complex Distributed Systems with Dynamic Grouping and Auto-Discovery
Hi, everyone. As software engineers and system architects, how many times have you looked at an architecture diagram something like this and thought to yourself, well, this is some really ugly spaghetti monster right there? And how many of you actually built a system like that? I did. My name is Alex Nazaroff. I'm a Principal Engineer in CloudWatch.
Many projects start with simple, beautiful architecture. But then the business grows, you add features, you add redundancy, you add security, you add search index, you add analytics, and the list goes on and on. Eventually, the system becomes so difficult to comprehend that even people who built that system six months down the road will have a hard time understanding what's happening under the hood. You see, the problem is not that distributed systems are over-engineered. The reality is that real-life scenarios require building complex systems, and these systems are stretching human abilities to understand and keep track of what's going on.
When we approached this challenge, we talked to our customers. And they said, you know what, instead of this, I want to have a way to look at all that I have in all my accounts on one page, kind of like this. Show me on the high level what I have and also help me understand the status, the health status of each application. But when I'm interested, I want to dive deep, go inside and see its guts, see what's happening under the hood. But then I should be able to go back, zoom out, and look it up on the high level.
Let me introduce the new Application Map. What we introduced here is dynamic grouping. Application Signals in this map is trying to do its best effort to discover resources like API gateways, load balancers, EKS, ECS clusters, EC2, and other scaling groups. It uses them as a backbone or like the tip of the iceberg to present a structure that forms the applications that you're hosting on AWS.
Previously, Application Signals only worked with instrumented sources. You would have to instrument your code. But we just recently added functionality that, as I highlighted on this map with the dotted lines, shows the applications that we believe are just entry points into something much bigger. It's not instrumented, but you still have all the benefits of minimalistic APM functionality for these services.
For example, I have this Ticketing Lite service that essentially allows customers to submit tickets. When I explore details for this application, I see first of all that it discovered four services under the hood. It tells me that it is breaching some SLOs. These are the most recent deployments for this application, which is taken from CloudTrail, and the golden signal metrics like availability, latency, server faults, and user errors.
I can dive inside of it, and again, everything is a dotted line here. It means it's not instrumented, but it's pretty straightforward. I have an API gateway, and it has three Lambda functions that implement some of the APIs backed up by list tickets, read tickets, and submit tickets. Pretty straightforward. Even from this map, I can already see that I have an issue that my customer is facing because it's an API gateway, and then it's actually going into the submit ticket, so submit ticket has issues.
From this overall description, I can only see that it got deployed recently. The deployments are marked as vertical lines on the graphs, but I don't see any correlation with deployments. So at this point, the best idea for me would be to instrument this application. I can go and instrument each individual Lambda separately, or on the high level, I can just use this context menu to enable Application Signals for all of the components inside of this application. So here I can just click enable on all the Lambdas.
Instrumentation Benefits: From Operational Audit to IDE-Integrated Troubleshooting
What's going to happen is that this enablement will add an OpenTelemetry layer to these Lambda functions. It will not add the agent, but only the SDK, the client side, that will talk to the OTLP endpoint. So the benefit of that is that you're actually not wasting resources on the agent, and the agent does not increase the Lambda cold start. The enablement takes roughly five minutes, so I already have a carbon copy of that ticketing service over here that is already instrumented.
Now when I look at the instrumented version of that application, I see more details, right? Now I can see that this is that problematic submit ticket Lambda. But it also brought in another Lambda and this SQS queue. So this is how they communicate asynchronously with each other, also showing me the tickets DynamoDB table that these Lambdas are using. So I got a much better picture about what is actually happening right now with my system.
After instrumentation, what is also available to me is this operational audit. On the high level, operational audit just goes and finds what you would do manually, just find traces and exceptions.
The operational audit finder identifies exceptions and surfaces them up, so you can actually see exactly what's happening without diving into the logs or doing any complicated research. You see exactly, okay, this Lambda called SQS and this is the exception message that says, well, you're trying to submit too big of a message to the queue and it doesn't accept it. This is how we got that exception.
Now what I want to also show is that this audit is available from your dev environment like IDE. So in this case I'm showing Kiro, but VS Code with a client would do exactly the same thing. I have here pre-installed CloudWatch Application Signals MCP server. And the way to get it is actually it's available on AWS Labs under the MCP folder. And there is a CloudWatch Application Signals MCP server. So what I can do now, this IDE talks to the telemetry that captures the state of my production. I can ask questions like, which services are breaching SLO?
And right now it actually went and scanned that map, the list of services, figured out which ones are breaching SLO and is reporting this to me. And I can also work with the individual services, like I can say, well, what's wrong with that SubmitTicket service, right? So SubmitTicket, it's going to execute a public API available that you can also use that runs the operational audit. And from this operational audit, it makes the same conclusion that we just saw that message on the screen. It's saying you're trying to submit messages bigger than the queue can accept.
And the benefit of this is that you can essentially do your root cause analysis and troubleshooting from the IDE and instantly connect your code with what's happening with your services in production. And the IDE already knows, like because you loaded that code base in the IDE, it connects that stack trace that it found in telemetry and it says which line of code is causing this problem. It suggests remediation or fixes for how you can actually mitigate that problem. Okay, so now, let's go back.
Custom Grouping Strategies: Organizing Services by Application, Tier, and Team Ownership
And talk about how to customize this map. So right now, what you're seeing here is a best effort to group all services, what we call like related services. Essentially, many applications have an entry point like a load balancer or some front-end service, API Gateway. And this is what we're discovering and showing, yeah, like you have, you know, this, for example, appointment service. It has like these microservices. This is how they relate to each other. There are some components that are not connected with anything. They have their own like standalone version.
But we also allow you to customize this map. And so, by default, you could see also all the services grouped by environment. So by environment, you can use OpenTelemetry attributes and say deployment environment name. And that way you can differentiate your production versus test, performance, etc. So that way, you can regroup all your services by environment. But in addition, we have this way of custom grouping.
So, it's a little bit involved, so bear with me for a moment. So let's say I would like to have different groupings of all the resources and all the services I have across multiple accounts. So I would create these groupings over here.
Like, I want to actually control what composes my application, so I have created a grouping by application. Another part that's important to me is tiering, tier one, tier two service, how it's important to my business, and then team ownership. I can assign services to the teams who operate on them. So this is on the left side.
On the right side, what I'm specifying here is AWS tags. These are the tag names. So if I have a tag named Application or APP.net or my app on API Gateway, the value that this tag is carrying will be the name of that application. The same with the tier and the team. So these are just AWS tags that I'm saying should be used to create grouping.
In addition to AWS tags, we also support OpenTelemetry resource attributes. I brought them here. So OpenTelemetry resource attributes is when you instrument your code, you have a chance to define additional key value pairs for these services that get instrumented with OpenTelemetry. And this is what you could use alternatively to the tags. You could use just OpenTelemetry resource attribute and pretty much do the same thing, specify that this microservice belongs to, let's say, Pet Clinic application.
So after that is configured, give it some time. And when telemetry gets ingested into CloudWatch, the Application Map will form these custom groupings. So, for example, this is the grouping by application because I have annotated my telemetry or annotated resources with AWS tags with these values of telemetry. And this is an example of, let's say, tiers.
And this is how I can work. I actually can chain this grouping. I could start with tiering. Let's say the most important to me, I want to work on tier one services. I see them all in red, but I'm interested in tier one first, right? So I go inside and it's quite complicated. So I can group on for the second level, say, well, now I just want to see them by application. Well, there's actually three different applications in tier one. Payment, Pet Clinic, Ticketing System.
Then I can go essentially inside of them and inspect, let's say, this application, Pet Clinic, right? Or I can go back completely back up and regroup by applications and see what's available in this account. So let's actually dive into Pet Clinic. And by the way, if you have any questions, please ask. Let's make it as interactive. If you have any questions, please feel free to ask any questions.
So what we try to achieve here is, that's what customer sentiment was, I want to use the map to guide me when I'm paged, for example. And so what this map is highlighting, let's say, four services in red, they're breaching their service level objectives. And besides that, every component also shows a health circle that shows a breakdown between server faults, user errors, and successful requests. And you can kind of gauge from that where the problems are.
First of all, what's highlighted in the red. Second, if let's say you didn't set up any service level objectives, you can just gauge from that circle. The more red means more problems essentially. And you can instantly go and inspect each component and get this operational audit. And for example, in this case, I get a whole bunch of these audit cards that tell me, we have this problematic
dependency, and it's set like some unhandled edge case, et cetera. And this one is interesting. One of the auditors will try to find an outlier, and by outlier, I mean some of the resources like a pod or a node or EC2 instance that's contributing the most to errors or to high latencies in your application. So this audit card is finding exactly this scenario, suggesting that you would want to recycle this resource, right?
Complete Observability: Integrating Synthetic Monitoring, Mobile RUM, and Transaction Search
But it's all good and great knowing what's happening on the server side, but sometimes we don't have full visibility into what's happening with what the client experiences because the server-side instrumentation gives us so much, right? And for that, there are two techniques that CloudWatch suggests you could use. One is synthetic canaries. So this is an example of a whole bunch of different synthetic canaries that are continuously running, executing APIs and validating that the responses actually match the expected results. And you can see like, well, they're actually not always succeeding. There's only a 96% success rate. But there is a reason they're called synthetics, so they're not fully representing the actual customer experience.
And to measure that, CloudWatch has Real User Monitoring. So Real User Monitoring is essentially a set of APIs, including the client-side SDKs that you can use to instrument a website. So like this example is a monitor that is essentially a bunch of JavaScripts that are included with a website that continuously measure and report back home about how these pages are loading, how fast, are there any JavaScript errors, et cetera. And what we just recently launched is a similar client-side instrumentation for Android applications and for iOS applications.
So that way, when you look at this application map, you're not only seeing how your services are performing, but you're also seeing what your customers are experiencing. And so from this picture, I could say that, like again, this health status circle tells me that, well, the Android customers are really having a bad day. I have too many crashes and also server faults, and these graphs kind of support that. They're just showing how historically those issues have been happening.
So at this point, I would like to troubleshoot more and get to the Real User Monitoring part of the AWS console. So in here, Real User Monitors in Application Signals are connected with the service that these client applications are hitting first, like in this case it's a Pet Clinic frontend. And I have three different Real User Monitors for this microservice. So this one is for website, this one for iOS, and this one is for Android, and I could tell from here, from details, which pages or screens are actually having issues in an Android application. And so, like, this is a number of crashes, and so the Owner Details Activity is the one that has the most problems.
To troubleshoot even further, like, first of all, I can go and use Android Studio, so I have it preloaded, to validate like is this actually happening. So I can go check on the list of all the owners,
so I'm running a simulation of this mobile app showing a list of all the owners. When I click on an individual owner, the application essentially resets, so I can see that this application is crashing on that particular screen. To root cause this issue even further, I can click on any data point on the crashes graph and I'll get essentially a list of sessions that resulted in application crashes. This data is transferred from the mobile app to the AWS RUM service. At this point I can just spot check, for example, I pick a session and I get transferred to the RUM console. Then here I can look at the exception, so this is the stack trace for this exception. I can just go and transfer it back to Android Studio, analyze this stack trace, and essentially go to the line that's causing the issue just to help me understand where I need to implement the fix.
Enabling AWS RUM for mobile, and I'm going to show this for Android, is very straightforward. There's one dependency you need to include in your build file, which is AWS Distro for OpenTelemetry, but it's packaged by Amazon. We just picked all the pieces needed from OpenTelemetry to support that functionality. You also need to create a monitor and specify in the configuration file that this application will be recorded through this RUM monitor and just specify the region where the telemetry will flow. That's how to set it up.
If I go back, just one thing I wanted to summarize is that Application Map is acting essentially as a guiding tool. You can think of it as a dashboard, but an interactive dashboard. From this dashboard, you're not only looking just at metrics and graphs, you're looking at your live system, how it performs at the moment or historically. You can go and check different times, let's say if you're investigating an event that happened last night or two weeks ago, you can change the time frame and go back to that time and see it will historically show how your system behaved at that time. You're not only seeing how your service is performing or how your resources are responding to requests, but now you also with mobile RUM can also see the end user experience and essentially that way oversee all your distributed system from start to end.
Application Signals, what it does is after it's instrumented, it actually connects metrics, logs, and traces together. For example, in Lambda it's pretty straightforward. There's Lambda logs and traces and metrics. For let's say Amazon EKS it's a little bit more involving, but it's still relatively straightforward. In fact, traces in CloudWatch do report and connect which log group to go to for corresponding logs. I can show one of the examples. When I'm troubleshooting a service, let's say this is a microservice, it could be chaos, and I have this bunch of operations, and this operation is obviously failing. At this point, this is where the sidebar connects all this other telemetry. On the left you see the metrics, here you see corresponding spans to the metric, but this is the link to application logs.
Application logs work similarly for EKS application logs. If Application Signals cannot connect the dots, you can customize your trace attributes and specify which logs to look for when that trace is emitted. We also have a transaction search feature that I'd like to show you. What we do is take your transactions in terms of traces, convert them, and store them as logs. This way, you can search for anything you need.
An ideal use case would be if you're a support person receiving a call from a customer saying their order isn't going through. All you need to ask them is their order number. Just go to the transaction search page, enter the order ID, something like 123, and the moment you click enter, it will retrieve all the traces or transactions that have happened in the last three hours, for example. It will show you all the transactions wherever that order ID 123 is being used. This is like finding a needle in the haystack kind of situation.
With the transaction search feature, because we're taking your traces and storing them as logs, you can perform all kinds of operations just as you typically would with your logs. This is what the transaction search feature offers, and Alex can guide you through that. The transaction search also benefits from discovered services. It will suggest which microservices and APIs you can search through, and it will help you identify which fields you can specify. Maybe some of the fields will be your request ID or your order ID, and you can narrow down your search that way. The point I was making is that traces in Application Signals are stored as logs, which means you can use log analytics essentially to better reason about your application behavior.
A follow-up question: if I've already instrumented for X-Ray, do I need to do that again for OpenTelemetry? No, there is one caveat though. When X-Ray originally launched, it had its own SDK, which we announced as being on a deprecation path quite a while back. Some people still use that X-Ray SDK, but starting a few years back, we announced that moving forward, the OpenTelemetry SDK is the one that X-Ray will support in the future, and we call it Amazon Distribution of OpenTelemetry, or ADOT. The reason for that is you can actually assemble it yourself as well since it's all in public repositories. We just prebuild it so it's easier to make sure all the necessary pieces are in one place.
This particular QR code has a bunch of links, so if you want to see them, everything that Alexis talked about is available in that one Observability workshop. We have an Application Signals demo as well, so all these things are interactively available so that you don't have to create any accounts and can try out those features. We also have a YouTube show called Cloud Operation Show that runs every two weeks on Thursdays. The recorded videos are also available there. We do have a separate Observability best practices guide where we've been talking to hundreds of customers, identifying patterns, and documenting all that information. That information is also available, so please take a look into it. If you have any interest, any follow-up questions, or anything else, please reach out to us. These are our LinkedIn handles, so please feel free to connect. Thanks everyone. Thank you everybody. Thank you.
; This article is entirely auto-generated using Amazon Bedrock.

























































































Top comments (0)