AWS re:Invent 2025-AI-Powered Observability & Observability for AI: The Two Sides of the Coin-AIM206

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025-AI-Powered Observability & Observability for AI: The Two Sides of the Coin-AIM206

In this video, Anand from ManageEngine explores two critical perspectives of AI in observability: AI-powered observability that helps manage complex digital infrastructures, and observability for AI itself to ensure AI systems remain reliable. He illustrates how AI correlates millions of events across networks, applications, and infrastructure to predict failures, using examples like Black Friday e-commerce scenarios and AI-driven auto-scaling incidents that can lead to costly cloud bills. Anand emphasizes measuring key metrics including drift, prediction rate, confidence levels, and cost alongside traditional monitoring metrics. He concludes that the future requires humans and AI working together, with ManageEngine's 60+ product suite following this holistic approach to enterprise IT management.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

AI-Powered Observability: Managing the Modern Digital Jungle

Hey, folks. Hope you're all having a great day. It's been a very buzzing morning. I think there are a lot of sessions going around. Just to give a quick introduction, I'm Anand, and I'm part of the leadership in ManageEngine, which is one of the major divisions in Zoho Corporation. Today, I'm here to talk about one of the most important and engaging aspects that impacts the day-to-day operations of most modern-day digital businesses, and that's nothing but AI. At the end of the day, here's the interesting part: it has two sides, two perspectives, and only when you bring them together and introspect a lot more on that, you'll get to get the real value out of it.

Let me not begin with the definition to bore you, but with a question. If AI is the watcher, who's really watching AI? This has stayed with me for quite some time. If you look at the systems, like there are so many booths running here, so many networking done over here, they are no longer small or simple. It's pretty clear. What did we do? We handed over some of the jobs to AI to maybe detect the anomalies that are happening with maybe a Wi-Fi connection problem across any sort of a booth, and even fix them to a certain extent, automate them in an autonomous manner. But on the hindsight, on the flip side, if you look at this, what happens if the AI itself goes wrong?

That is something that is a lot more interesting. And when it does, in most situations, it doesn't give you a very big alarming red flag. It fails very quietly. This is the reason we always think about this part in ManageEngine and a lot of products across the ecosystem, and we keep introspecting. I wanted to share my thought process on both sides of the story. On one side, it'll be more on how observability is powered by AI and how it helps us to manage the chaos. On the other side, how observability for AI itself is required, and it's more crucial so that we can keep AI itself in check. In the process, I will try to share some exciting stories or usual stories that you would have already heard, but closer to the AI part, I will try to present my thought process and take this as engaging as possible.

Let me just rewind for a moment. I think we have a lot of veterans over here. If you rewind yourself, take yourself maybe a decade or two back, monitoring was pretty simple. We had a set of monolithic applications that got deployed in cluster setup or grid setup in co-location data centers, and you have networking that is done across all the applications and bytes being captured, and you used to record a lot of things so that whenever there is a problem, you get the root cause analysis from the monitoring solutions, and you go towards it, try to figure out what is the problem, and we sometimes take the logs and debug it and fix it and life moved on.

But if you look at the current scenario, it is no longer this simple. The reason being we have multiple cloud providers, apart from AWS, we have multiple cloud providers with multiple regions, and the problem is you are being given the option to deploy in just a click of a button. Apart from this containerization, you have thousands of containers spawning up and down on a random basis, which sometimes goes out of control. Moving from here, you have containers calling APIs, and the APIs are calling other APIs in endless chains in the agentic era. Finally, you have a set of users and behaviors of users who leave the app the moment it is spotted to be slow.

If you consider this entire scenario, it's no longer a system anymore. It is more of a jungle that we are actually dealing with. Full stack observability is some sort of a solution that we are trying to figure out still to map this jungle. If you ask me whether we can navigate this jungle easily by creating more dashboards or more alerts or more workflows, the answer is going to be a big no. It is going to be how efficiently we use AI to solve and automate this entire workflow as much as possible.

You see that there is a lot of ecosystem-related problems in this scenario, but it acts like a co-pilot. Most of the people fail to realize the fact it acts more like a co-pilot to the existing humans in the observability ecosystem. It scans millions of events that are going to go across networks, end user experience, servers, applications, and it scans almost all the events. It correlates all those events together and clusters those signals.

Those signals help us connect the dots across every layer in the application ecosystem, and sometimes they even help us predict what might break next. That's the most important aspect that we actually rely on AI for. If you check out this particular visualization, it acts like an air traffic control system in the modern digital world. There could be thousands of planes flying across the sky. You're facing a lot of delays right now in flights, and there are a lot of cancellations happening. But AI helps us spot exactly which flights we have to look into right now to make sure we give a convenient experience to the customer at the end of the day.

This could be a little bit of an abstract example, so I'll give you one that happens on a routine basis. I've been traveling here for a decade, so every time when I come here, I see Black Friday happening the week before. Just imagine that you're using an e-commerce app and you're doing a marketing campaign for a discounted product. Say there's a Nike Pegasus Premium or an Omega Premium that's being sent out for 50% off. Who doesn't like it? A lot of users just come and bombard your app, and they all face a single checkout problem at some point in time when the load is going to be unimaginably huge.

With traditional monitoring systems, the situation could turn out to be big chaos. What you'll get is a single pane of glass, which is a common term, and you'll have CPU, memory, and API response time. Everything will be spiking up, and your operations team's tensions will also be flying very high. They'll be coming to a single desk and collating all this data. But by the time they arrive at a solution, you would have lost all those users who are actually in the checkout process, and millions of dollars will be out of your bank balance at the end of the day.

How can we do better with AI-powered observability in this scenario? With solutions that actually have AI-powered observability, it seamlessly brings together all the events, correlates among itself, and gives you an efficient root cause analysis in a smaller time frame. That's the most important aspect, because if you're doing it all in a manual manner, it takes a longer time to uncover a problem than if you do it with AI. That's the best part you get from that. Once you get the root cause analysis, maybe the root cause is a rogue query that's routinely happening, or it can also suggest capacity-related issues and give you suggestions on how much you've got to provision appropriately to get out of this situation.

When AI Fails Quietly: The Critical Need for Observability for AI

But wait, AI is doing all these jobs for you, but what happens? The root cause analysis itself can drag you down to a different problem statement. We used to face this more often. We have seen these problems happening in our own production ecosystem inside Zoho as well as ManageEngine. It will give us an anomaly or a false prediction which takes our entire operations team towards a completely different direction. It's more like a ghost hunting trial if you look at the situation. They will go towards one particular problem statement, and the scariest part here is that when AI is failing, it fails with utmost confidence. It makes you believe that it's always right.

I think most of you folks would have used ChatGPT and you would have believed its responses. On the hindsight, you may think that you might have a different assumption, but you won't question the assumption of ChatGPT because it's working with a huge amount of data to give you the right answer. So you keep believing in that scenario. Imagine this side of the story as well. I use a lot of weather apps. On a routine basis, I rely on them to book meetings and go across zones to actually meet customers.

Just imagine that I need to go to New York right now. There's a huge snowfall happening in New York, which is a given event. Say I plan this almost a week ahead right now, and I'm planning to go to New York in the December third week. My weather app states that there is not going to be any sort of snowfall that's going to happen in New York right now. The problem here is that I take a flight, I go and land, I make a check-in and wait for two or three days to meet my customer, but suddenly there's a huge snowfall. There's a pile of snow getting piled up, and there are machines that are trying to actually clear the snowfall problem for you. My entire week is gone just without meeting the customer because they cannot come out and I cannot get to their place.

And that is pretty much a very big problem statement. Now, if you think whether the weather app is wrong, no, the weather app has been in use for more than a couple of decades. I can't question it, but the problem here is that the model that is used to forecast this weather pattern could have gone without any update, or else just consider the API that the weather app itself relies on to actually collect the data to update models. Maybe that has gone out of service. All these details are actually kind of hidden gems or hidden dragons which do not get uncovered in most of the scenarios, and this is where we could go for observability for AI as such.

So what do we mean by observability for AI as such? Just like we monitor the CPU, memory, and all those infrastructure-related metrics to make sure that everything is healthy and it is in sanity, we have to actually have a close eye or a third eye on top of all the AI-related workflows and its behaviors in a continuous manner. If AI is part of your tech stack, you need to have the same level of visibility and data collection that you have already had for an existing tech stack all the time.

So how do we do this? Now, I will just take you through a classic example of how we do infrastructure-based auto-scaling nowadays in most of the scenarios. Just consider that you have an operations team and it has actually created a blueprint for doing auto-scaling that is AI-driven for your entire Kubernetes ecosystem, and they have run this AI-driven auto-scaling multiple times. It is bulletproof, stress-tested for multiple overnight scenarios. Your operations team has been loaded with a lot of work for the entire Black Friday. They are kind of exhausted and they want to take a Thanksgiving weekend without taxing them off. So what they do is they turn on this auto-scaling mode button and enable this blueprint, and they go out on a vacation for those couple of days.

All of a sudden at some particular point, you see that a lot of resources get provisioned by this auto-scaling bot, and it is due to some sort of a pattern that is being misread by AI to be considering it to be an auto-scaling scenario. And this particular spike, if you consider this pattern, sometimes there could be periodic spikes that happen within a particular hour itself. So that is a concept in the cloud ecosystem where if you rotate the resources on a continuous basis within a particular hour, you get charged for the entire hour. So what happens here is, say there are 10 auto-scaling incidents happening within an hour and it continues to happen for 48 hours. You're going to end up with 500 auto-scaling events within the time your operation team gets back on a Monday morning. So they'll be staring at a huge cloud bill and I pity them.

So what we could have done differently with observability for AI over here? We could have identified this anomaly detection pattern by the auto-scaling AI-based auto-scaling system earlier, and we could have merged this context with traditional monitoring data to actually consider this to be a sustained peak in most of the scenarios. And we could have provisioned systems that could run on a permanent basis, and at the end of the day, we could have saved a lot more on your cloud bill.

So this takes us to the heart of the subject. Measuring what really matters is going to be the agenda that you've got to have when you're taking both sides on AI-powered observability and observability for AI. The first side is more of the classic ones, which is going to be more of observability metrics, CPU, memory. I've already repeated all those things. You already know about all those things, and this helps us to make sure that the system is healthy and available all the time.

The second part is something that is going to be new that you've got to be a lot more interested about. It's going to be the drift, prediction rate, confidence levels, and last but not the least, cost. So these metrics will help us to make sure that we can continue trusting AI for making the decisions for us. And we should also consider another fact that at scale, whether AI will run at a rate which we consider to be a sustainable one. So sustainability is one more thing that we need to consider as a factor in case of AI.

The Complete Picture: Integrating Both Sides of AI Observability at ManageEngine

This is one of the reasons why we always feel that FinOps is one of the major players in the world, as we all know that we are never blessed with infinite budgets. Now, only when you bring these two sides, I spoke about two different paths together as part of a single coin, or consider these two different aspects as a single entity, you will begin to see the real value out of it. So if you check out AI-powered observability, it gives us speed, scale, and foresight in most situations, and observability for AI gives us trust, reliability, and accountability.

Together you actually get the full picture of observability, and it gets completed only when you consider both sides, because going forward, it's never going to be about infrastructure and applications alone anymore. It's going to be infrastructure, applications, and AI that unifies them into a single intelligent system. This is the reason we have followed this as an expectation at ManageEngine, and we have built our IT suite with this exact thought process.

We use AI to seamlessly integrate and correlate all the signals across applications, users, and infrastructure in order to reduce or cut the noise for the enterprises and help them focus on exactly what's needed. With a comprehensive suite of solutions that is going to be more than 60 plus products with complete focus on solving or covering all the aspects of enterprise IT management with an integrated suite, we follow a holistic approach to AI adoption where we mix all the distress signals from across the ecosystem and we create value in a unified manner.

So let me leave you with this thought process for the day. The future of observability is not going to be about humans versus AI. It's going to be more about humans working with AI as active enablers. At the end of the day, we all know that we cannot fix what we don't measure, and AI is going to need the help of humans to measure what's really needed.

So I hope you got some valuable information today. Thanks for coming over. I'd be more than happy to meet you guys at our ManageEngine booth, which is at 147. Have a great week ahead.

; This article is entirely auto-generated using Amazon Bedrock.