Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Intelligent Observability & Modernization w/ Amazon OpenSearch Service (ANT315)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Intelligent Observability & Modernization w/ Amazon OpenSearch Service (ANT315)

In this video, Sohaib Katariwala and Joshua Bright demonstrate how Amazon OpenSearch Service addresses observability challenges for modern microservices architectures. Using a fictional e-commerce company "Any Company," they showcase the complete observability stack: data ingestion via OpenTelemetry and Amazon OpenSearch Ingestion, storage with tiering capabilities, and analysis through the redesigned OpenSearch UI. Key highlights include enhanced Piped Processing Language (PPL) with doubled capabilities including joins and time analysis, streamlined workflows consolidating querying and visualization in the Discover experience, and AI-powered features like natural language queries and result summarization. The session culminates with a compelling demo of agentic AI using Model Context Protocol (MCP) and Amazon Q Developer CLI, where an AI agent autonomously investigates a Black Friday checkout failure by querying logs, traces, and metrics across multiple indexes, identifies database timeout as the root cause, calculates business impact using sales data, and generates comprehensive remediation plans—all within eight minutes, representing a 70-80% reduction in incident resolution time compared to manual investigation.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Any Company's Observability Challenges in a Microservices Architecture

Welcome everybody. My name is Sohaib Katariwala. I'm a Senior OpenSearch Specialist with Amazon OpenSearch Service, and joining me today is Joshua Bright, Product Manager within the OpenSearch team. Today we're going to be talking about Intelligent Observability and Modernization with Amazon OpenSearch Service.

The takeaways for today's session are first we're going to start with using Amazon OpenSearch Service for observability. Then we're going to go into the new analytic features that have been launched over the past year, and finally we're going to talk about agentic AI for observability, so you want to make sure you stick around for that one.

But before we dive into the technical details, let's first talk about Any Company, a fictional company. Any Company is a fast-growing e-commerce platform that helps small and medium businesses create their own marketplaces. The goal is they're starting to become more and more popular with businesses looking to compete with major retailers. They have an architecture that consists of modern microservices, so their front end is React-based with a customer portal and a vendor dashboard. Their back end services include order and payment management, inventory management, and authentication. Their infrastructure is containerized services on AWS using EKS, and they are also using databases like RDS and DynamoDB for their data storage.

However, they have some challenges with this architecture. The first challenge we're going to call microservices complexity. This is where Any Company's distributed architecture creates visibility challenges across the many microservices they have. When checkout failures spike during flash sales, teams are struggling to find whether it's a payment service that has an issue, the inventory service that has an issue, or some sort of connectivity problem. The second challenge we're going to call data pipeline chaos. With multiple data sources, databases, and data formats from different services coming in, the administrators and engineers face constant challenges making sure the pipelines are calibrated and performing up to par.

The third challenge is developer productivity. DevOps teams spend hours manually looking through logs and correlating issues across traces and metrics, and they need to write complex queries to understand what's going on and build custom visualizations for the different stakeholders that want to know what's going on across their different applications. This takes a lot of time for these developers to be productive. The fourth challenge is cost management. The observability data volume continues to grow and explode. For Any Company, over the past year it's grown over 300 percent, and they're continuing to scale, which makes traditional monitoring solutions cost prohibitive. Are any of you facing any similar challenges in your organization? I see some hands and some head shaking, so this is what we hear from a lot of our customers.

We've compiled these things from all the customer conversations that we have and we're representing it through this fictional company called Any Company. Now, what we see is the common challenges that customers tell us every day. Most modern applications like the e-commerce platform that Any Company has are distributed across many services and microservices. With these many services and microservices, the visibility into them can be extremely low, especially for inter-microservice interactions and interaction with other AWS services. When Any Company's checkout failures spike during their Black Friday traffic surge, the engineering team was left wondering what was the issue. Was it a bug in the checkout service? Was it a payment API failure? Was it a database connectivity issue? What's really going on? The reality is that these decoupled code and services are really hard to diagnose.

Each of Any Company's components emit some signals through logs, metrics, and traces, also called telemetry. Their checkout service logs errors, payment service tracks transaction metrics, and so on. This is creating a lot of different signals, and having to manually correlate and find the needle in the haystack becomes very challenging. This is where teams get stuck during a crisis. Recently, for example, Any Company's main website went down and engineers had to manually grep their logs and spent hours trying to find what was happening across their distributed system. To remediate these failures effectively, you need analysis of interactions and code across all these distributed components. This is where unified observability becomes critical and where a comprehensive approach is needed.

Understanding Observability Platforms and AWS Monitoring Services

An observability platform transforms how any company's teams can monitor, troubleshoot, and optimize their modern cloud-native applications. So what is an observability platform and why is it useful? To recap, an observability platform collects information from your entire system in real time to help find and resolve unexpected or unknown issues. It helps builders and administrators efficiently detect, investigate, and remediate issues.

Usually this is done by creating insights over telemetry such as metrics, logs, and traces. An observability platform offers developers the ability to understand applications better and provides tools to analyze root causes in the event of failures. As you know, observability is a very important part of a workload, just like other things such as scalability and loose coupling. Observability is critical to get right.

AWS as a whole provides many choices in monitoring and observability services that you can use to collect, store, investigate, and alarm on data from your infrastructure and your applications. Together, these services complement each other by providing insights and analytics using predefined instrumentation and visualizations. For example, Amazon CloudWatch is a service that monitors applications and responds to performance changes. It optimizes resource use and is useful for simpler real-time performance monitoring of your AWS environment.

You also have Amazon Managed Service for Prometheus, which is a managed monitoring and alerting service that provides data and actionable insights for container environments deployed at scale. Then there is Amazon OpenSearch Service, which allows for deep log analysis for complex search needs across logs, metrics, and traces, especially for longer-term data storage needs.

Any company has a very complex tech stack with many components that require deeper analysis capabilities and longer-term storage. This is what makes Amazon OpenSearch Service a great fit for any company. Today, we are going to focus on achieving a world-class level of data-driven insights using Amazon OpenSearch Service and all the exciting new enhancements and features that have been launched to make achieving this full-stack observability easier than ever.

OpenSearch Ecosystem: A Robust Foundation for Observability

OpenSearch itself is a community-driven open-source platform which is extremely versatile and covers many use cases that include lexical search, vector search, semantic search, and observability. Thousands of customers trust OpenSearch with their production workloads. In 2024 and 2025, the OpenSearch Software Foundation has become a project of the Linux Foundation to foster open collaboration for search, analytics, and observability. Many companies have already joined the project to continue supporting the open-source project and the ecosystem.

The OpenSearch project continues to grow with 1.3 billion downloads and more than 3,400 active contributors, now at 28 releases. Many members and contributors are joining, so if you are interested in contributing back, this is a great community to get involved with.

Amazon OpenSearch Service is an AWS managed service that lets you run and scale OpenSearch clusters without having to worry about managing the monitoring and maintenance of the infrastructure or having the expertise in operations of managing the cluster. That is what Amazon OpenSearch Service gives you.

OpenSearch and OpenSearch Service is a robust ecosystem of tools that can make it easy and fast to build a robust observability platform. To ingest, filter, transform, enrich the data and route it from your applications to an OpenSearch domain or serverless collection, you can use Amazon OpenSearch Ingestion, which is a feature of OpenSearch Service that is great for ingestion. To store the data, you can use either an open-source cluster of OpenSearch that you have deployed and managed yourself, or what we recommend is to use Amazon OpenSearch Service, which takes all the administration out of having to deploy it yourself. Then to take it even further, you can use Amazon OpenSearch Serverless, which is completely zero administration, automatic scaling, and you pay for what you use.

Finally, you can use OpenSearch Dashboards to do application debugging, visualizations, and analyze application behavior. OpenSearch Dashboards is a purpose-built user experience to get insights out of your data. Now I'm going to go into each of these pieces of the architecture and talk about how any company can use OpenSearch Service to accomplish their goal of addressing those challenges that we talked about earlier.

Building the Data Pipeline: From Collection to Storage with OpenTelemetry and OpenSearch Ingestion

First, we'll talk about how any company can start collecting signals from their microservices and pieces of their infrastructure. Your applications, just like any company's applications, can be either AWS native or custom applications with infrastructure such as databases, containers, and virtual environments. Any company collects these by running what is generally called agents. Agents are basically small software processes that run next to your application or run in parallel to the containers to collect metrics, logs, and traces and export them forward to the observability solution.

One of the most popular mechanisms for collecting the data that AWS customers are using is OpenTelemetry. It's a set of vendor-agnostic SDKs and libraries to instrument your applications. It supports logs, metrics, and traces collection, and due to its popularity, many vendors have started to support it. AWS also offers AWS Distro for OpenTelemetry, which supports natively many AWS services and offers collection from and storing telemetry into the native AWS solutions like OpenSearch Service.

Now any company sends their data from these collectors into Amazon OpenSearch Ingestion, which is a feature of OpenSearch Service that gathers the data and buffers it using its built-in buffering capability. Before writing it to OpenSearch, you can transform the data so any company can format it in the way they would like and then forward it to Amazon OpenSearch Service. They also enrich their data and can parse logs and metrics during this ingestion using OpenSearch Ingestion.

Now that the data is collected and transformed, OpenSearch Ingestion pipelines write this data to Amazon OpenSearch Service clusters, or domains for short, for short to medium-term storage and analysis. OpenSearch Service also has built-in tiering that allows any company to retain data for longer periods of time at lower costs. You may decide to store some of the data in OpenSearch in the hot tier, then a warmer tier, or directly in S3 as a cold-level tier of data and read it into OpenSearch Service on demand. All of this data has the insights that any company needs to debug or understand their applications.

Now that the data is stored, how do they surface this data and allow their engineers and DevOps teams to get insights out of it? To surface these insights, OpenSearch Service comes with a built-in guided user experience called OpenSearch Dashboards, which is now called OpenSearch UI. We will discuss this in detail in the next section. Now that we've seen how we've collected the data, transformed it, and stored it, we can use dashboards. Now we need to see how to measure and gain insights from this data.

Gaining Insights with OpenSearch Dashboards and the Next Generation OpenSearch UI

Let's go back to any company's engineering team. They need to know exactly what to measure across their distributed architecture and how to interpret those measurements and turn them into performance improvements. The DevOps teams need to measure checkout latency and response times from the different services. How will they be able to do that? Well, they can use OpenSearch Dashboards, which is a purpose-built user experience to get the most out of your observability data. This is the landing place where you connect to OpenSearch UI and OpenSearch Dashboards. You see the dashboards that you've created. It comes preconfigured with widgets that give you the insights you need and can help you perform root cause analysis. It also offers the ability to create new visualizations with easy drag-and-drop capability. Once you're happy with your visualizations, you can embed them in your applications. It's multi-tenant, which means you can have multiple teams with access to different applications and data, ensuring the right people are looking into the right data that they're responsible for.

Now analysts and developers can also use something called the Discover experience. Discover is used to explore data using a variety of different supported languages, including Dashboard Query Language, Lucene, standard SQL, and PPL (Piped Processing Language), as well as natural language queries, which you can see in the screenshot below powered by Amazon Q. When you're debugging or analyzing application behavior, you need the ability to filter, calculate statistics, and sort data to get the insights you're looking for. OpenSearch Service has a powerful Piped Processing Language which you can use to filter and measure various metrics or KPIs. For example, you can construct your desired outcome in a step-by-step manner with each step getting you closer to your desired result.

Dashboards and the Discovery experience are really good if you're there looking at them and you know what to go after. But what if you're not around? What if you're not actively logged in and looking at this user experience, and you really want the observability solution to keep an eye on the application telemetry for you? OpenSearch Service has robust monitoring and anomaly detection capabilities which keep an eye on your data and send you an alert in the case of a failure or if it finds any unusual or anomalous patterns. You don't really need any machine learning experience to configure these anomaly detectors and set up alerts, which makes it really popular amongst any company's teams.

Speaking of alerts and notifications, you can receive these notifications in a variety of different ways or channels. You can send them to your favorite Slack channel, for example, where any company's teams route their alerts to their Slack channel. You can also route them to mobile devices using built-in alerting and incident management tools such as PagerDuty or OpsGenie. Generally, you have a link back to the dashboards in the notification that is sent to these channels. The developers can quickly click on that link and continue the investigation by logging into the UI.

The other experience we see is that microservices are very common amongst most of our customers. When you're working with microservices, there are many moving parts, and investigating all of these moving parts can be difficult. To make this simpler, we use what's called traces. Traces capture the communication between different services so that you can know what happens when a service calls another service, whether it was successful or not, and so on. OpenSearch Dashboards offers purpose-built visualizations that analyze this data. We have something called a service map with which you can see if there's any error in your application at a glance or if your applications are facing any issues with higher latency and error rates. You also have something called trace group visualization which groups related traces together into a single widget and allows you to see if a certain function in your application, for example checkout, is facing an issue.

With these different visualization capabilities, you can pinpoint exactly where to look, uncover the logs that cause the issues using traces, and then start analyzing those logs. The volume of operational data that customers need to analyze is continuing to grow. As we hear from many customers, data volumes are continuing to grow. OpenSearch Service already supports observability workloads up to 25 petabytes, and of course in the future, like everything else, it's continuing to grow. To meet this growing scale, customers often store operational data across multiple OpenSearch Service deployments. Maybe they have multiple clusters, or maybe they have some clusters and some OpenSearch Serverless collections. Customers are increasingly asking OpenSearch to work across multiple data sources. They want the dashboard experience, but they don't want to tie it to a specific cluster.

To centralize all this data management and give this view in a single place, we've launched the next generation OpenSearch UI. The next generation OpenSearch UI is an independent dashboard application designed to help customers aggregate comprehensive insights into a single unified view. It allows you to see and view data across multiple OpenSearch domains and collections. Currently, applications can be associated with multiple OpenSearch clusters, OpenSearch Serverless collections, and even other sources like direct query to S3.

Now that all these dashboard instances are consolidated, it becomes even more important to have a way to organize the data and the dashboards from the different sources, the different alerts, and the different saved queries. That's why we introduced workspaces in OpenSearch UI. With workspaces, you can easily create your dashboards and save them, as well as your alerts and queries in a private space. This private space allows you to manage permissions tailored to how your team needs to share their data. Workspaces also gives you a curated experience for popular use cases such as observability and security analytics, so you can find it straightforward to build content for your use case. Workspaces also supports collaborator management so that you can share your workspace only to your intended collaborators and manage permissions for each collaborator as you'd like.

What's New: Simplifying Log Analytics with Enhanced User Experience and Piped Processing Language

That was all launched last year. There have been some exciting updates and features that we've added to OpenSearch UI this year so far, and now I'll turn it over to Joshua to walk you through those new updates. Thanks. I really appreciate it. Is everybody able to hear me okay? Perfect. Before I get started talking about what's new within OpenSearch Service, I wanted to take a moment. We're on the heels of the Thanksgiving holiday, so I want to thank all of our OpenSearch customers as well as the open source community. We really appreciate your partnership and love working backwards with you, and I look forward to announcing more what's new features next year in conjunction with you.

The what's new section is going to be two parts. One of them is more of your visual deterministic analytics with OpenSearch UI, which will be my section, and the second section will be with Sohai, who will be talking about agentic development. Let's get started. We looked into the issues that companies were calling out as problems, and we realized that it seemed like everyone was trying to solve this with more features and more complexity. We took the opposite approach. What if we made log analytics simpler, not more complex? What if we prioritize the user experience instead of adding additional features and bolting things on?

That's what we've done with OpenSearch UI's observability workspace. We've made Piped Processing Language, the language that Sohai was talking about a little bit earlier, the forefront, and we've complemented it with AI in the form of natural language as well as with a result summarization feature. Now you can query with PPL and supplement it with the natural language experience. In addition to that, we've also made it easier to ingest data from OpenTelemetry so that teams can go from their raw data to actual insights with very little effort.

Let me give you a concrete example of what I mean. Any company's data admins were complaining that it takes a lot of time to set up these pipelines to ingest data into OpenSearch or into other log analytics tools. You have to configure parsers and configure all of your mappings, and you're debugging and playing whack-a-mole trying to figure out why data is not landing the way that you would expect. The setup is a lot of overhead and super frustrating when you're just trying to get something out the door.

Our approach flips this completely. We provide out-of-the-box blueprints that will get things set up for you. We cover popular AWS logs such as ALB logs, CloudTrail logs, and Lambda logs, as well as third-party logs like Jira integration, OpenTelemetry, and HTTP.

However, that wasn't enough. We also wanted to make the workflow fundamentally easier, so we have a new get started workflow you can see in the OpenSearch console. This new setup allows you to set up an OpenTelemetry pipeline with all the bells and whistles included. You point it at your cluster and it will automatically set up an OpenSearch UI instance for you. No more struggling to get your proof of concepts up or new pipelines—it's super easy. Any company's teams can now get started analyzing their logs, whether that be their React front end, order processing service, or recommendation engine, in minutes instead of days.

But even when you get through the onboarding, you hit another wall, which is actually being able to utilize the tooling. Customers like any company told us that they have lots of queries and dashboards already set up in another tool. They don't want to relearn a new tool, a new language, or new workflows. So we made a fundamental decision. We made Piped Processing Language feel familiar. If you know pipe-delimited languages from Unix or other tools, you'll feel at home in OpenSearch Service. We aligned our syntax, our commands, and our functions to feel natural. The result is that any company's team's existing knowledge becomes an asset, not a liability. Their migration, which was going to take six months to a year, is now something that can be accomplished within a few weeks.

But language is just the start. You also need the right words, or in this case commands, to express the complex ideas that you're interested in extracting. Over the past year, we've more than doubled Piped Processing Language capabilities. We've added joins and lookups to be able to join indices together. We've added comprehensive time analysis commands like time chart and event stats that allow you to understand events over time. In addition, we've included the ability to extract unstructured data, which is not currently possible in OpenSearch. You can extract the data and create new fields for your analysis using regex and SPL. This isn't about just adding more features; it's about having the right tools to ask sophisticated questions of your data. Now any company can correlate their checkout failures with their recommendation engine in a single query.

Let me show you what I mean. Anyone can write a query that says show me all the errors—that's table stakes. But observability isn't just about collecting data; it's about asking sophisticated questions. The real insights come from when you can quickly identify errors that are occurring, quantify how large the impact is, and understand the next course of action. For any company, that means connecting checkout failures with their authentication service performance and payment processing latency. That's where you find the root cause, not just symptoms. Now you have all the tools that you need to gather those insights.

Any company's teams were losing quite a few hours moving between workflows. We talked about a familiar syntax and the new commands and functions that exist. The last piece that we wanted to target was the improved user experience and streamlining. At any company, they were moving between the querying experience, the visualization building experience, and the dashboarding experience. What we've done is consolidate everything into the discovery experience.

Instead of moving between different areas of your logging tool, we've built out those different workflows into Discover. Now when you analyze your data, you're able to not only do that from a results perspective but also enhance and complement that with visualizations. You can very easily add that into your dashboard, capturing all of those critical things you need to set up and support your APIs all within Discover. Folks can stay in the flow from question to answer to action.

This applies to your data in OpenSearch, but certainly also applies to the data where it rests. We have integrations with Amazon S3 for historical and audit logs. Instead of piping your data from CloudWatch Logs into OpenSearch, you can analyze your data in CloudWatch Logs from OpenSearch. We also have integrations for doing security investigations with Amazon Security Lake. Let's bring this together. We built our solution on three pillars.

The first pillar is easy startup. You have easy startup with these new blueprints that allow you to get started with very common log types. We created a new get started workflow that allows you to very quickly create new pipelines and utilize the OpenSearch UI. We've made it easier to get started by building out a familiar syntax that everyone coming from pipe languages easily understands. We've also made it easy to get started by incorporating natural language prompts within the querying experience itself, so you can ask questions of your data and get that analysis back.

We also have the AI summarization feature, which allows you to understand your results. As you type in your query, it will summarize the results and provide you an understanding of what is in the results set. That's the easy startup section. The next section is that we added additional commands and functions in pipe processing language to really unlock insights like never before. Finally, we created a cohesive work experience. Now you no longer have to move through different workflows. You can accomplish all of your insights right within OpenSearch Discover and be able to very easily create your visualizations and add those into a dashboard.

Live Demo: Investigating Errors and Creating Dashboards with PPL and AI Assistance

I'm going to move over to the demo now because there's a lot of talking, but I like to see action. We understand that there's a problem with the load generator service. We're going to query the load generator service and understand what kind of errors are coming through. I do a simple where statement which pulls all of the errors from the logs. Now we can see that the load generator service is showing up.

For the next section, I want to show off the ability to pull out unstructured data. I'm going to show here that we have the Rex command, which allows you to extract data using regular expression. You can see in the table below that you have error type as well as error message. The nice thing is we have all of these different types of visualizations that you can select from right within the Discover experience. That's what I mean about having this comprehensive and cohesive experience. It's all within Discover with no need to move around to accomplish those tasks you were hoping to do before.

Next, I would like to understand the error rate. I'm going to be filing a ticket because I need to do this investigation. I need to file a ticket, so I need to understand what's happening from an error perspective. You can see I use event stats, which allows me to track the error over time.

Calculating both the total events and the error count, as well as the error rate, I can see that a 16% error rate is not acceptable, so we need to do better. I'm going to assign this ticket to someone who can help us resolve it. In order to do that, because I don't inherently know where the ticket needs to go, I'm going to join our air data with our service catalog. Before we do that, we need to understand when the error occurred so that we can fill out the ticket properly. We use the time chart command to understand what happened, and with this visualization, I'm able to quickly understand when the error started and when it ended, so I can fill out the full case details of this ticket.

Then we get to the join. I have a service catalog on the side that has all the details of who is responsible for what. Now I'm able to join my data across the errors or the logs with the service catalog data, and I see that Charlie is unfortunately in trouble. Charlie is going to get a ticket, but now that's a really great thing. We have all sorts of these new commands in the piped processing language that we weren't able to do before with DQL, so now we're able to unlock all sorts of new insights, and we're really excited about that.

But what if you're not a piped processing language expert or guru? I talked a little bit earlier about the AI assistant that helps you build out your queries. I can very easily just come in here and type out an English prompt. The nice thing is not only do I get the results, which are fantastic of course, but I also get the PPL statement as well, so you're able to learn. You're able to learn as well as get the results that you need using this language assistant. I can iterate on it, so now I can go over into the query bar and make adjustments to the query if I wanted to, so it's super easy.

But let's say it's 2 in the morning and I got paged, which is super frustrating. I know we've all been there. I have the AI summary feature as well. I can go and execute the query and then use the AI summary feature to extract those insights from the results set to very quickly give me hints as to what to do next and who to contact. I mentioned the visualizations and all of the options that we have with visualizations, but really it is as easy as executing your query, analyzing your results, and getting that perfect visualization that you need for supplemental materials for your ticket. You can add it to a dashboard from right within the Discover experience. Now we have a dashboard, and it's easy. What's even easier is that we talked about having an easy setup, a great rich analytics experience, and a cohesive experience from a discover visualization.

Agentic AI for Observability: Introducing Model Context Protocol with OpenSearch

Soheb is going to talk to us about agentic development, so I'm looking forward to that. Thank you. Now for the exciting stuff. We've seen how any company's engineers and analysts can use this improved UI experience to easily get started, use familiar languages, and analyze data using a rich set of features and query languages to reduce the time to resolution. But before I jump into the next thing, which is a separate team in any company wants to take it even a step further. They want to know how they can use AI and agents to help with the same kind of things that we saw to speed up investigations and the time to resolution without needing to have any engineers and analysts using this UI experience.

What we hear from customers are these two different sorts of requirements. One is how can we work without having a team of engineers and analysts.

This is where teams are staffed short and don't have the time to go and do the investigation and use Piped Processing Language like we saw. This is where they could use the help of an AI agent. There are other teams where they already have engineers and analysts that are used to using these tools and they just want to make it easier. They prefer to go themselves and build visualizations and build integrated queries and visualizations. So they want the best of both. They want the ability to easily do the UI and they also want this easy agentic experience. That's why we have the second piece of it, which is how we can use AI agents to make this even easier.

Any company wants to provide deep and actionable visibility to AI agents that can monitor and analyze and reason and improve observability processes within their organization. They want to have an AI agent that has access to tools such as the list of indexes in the OpenSearch clusters and other data sources like S3 and CloudWatch. They want this AI agent to be able to get the metadata from the OpenSearch cluster as well to see what all the indexes are, what they're called, what fields are there so that it can properly know which fields to query and where to run the filters.

They want to do this very easily to get started and do a quick proof of concept to see if this works before they decide to move it into production. They want to see how they can start to do this and what steps they can start using to try to remediate future issues as well using AI agents. This is where one of the main components of OpenSearch becomes helpful. OpenSearch already has a capability called Model Context Protocol, or MCP. Traditionally, connecting multiple AI agents to different data sources required individual connections from each agent to each source. MCP simplifies this by introducing a centralized component that handles all the boilerplate code for connectivity and makes the system more efficient and manageable.

The MCP consists of two parts: an MCP server and an MCP client. The server is a lightweight program that invokes the REST APIs of services like OpenSearch, and the client is an adapter for allowing the AI agent to use the server's functionalities. For OpenSearch specifically, it has an MCP server in the OpenSearch community. It was developed by the community with community-driven development and support, which is one major benefit of using OpenSearch and the OpenSearch MCP server. It has flexible communications and supports standard I/O protocols as well as streaming. It's adaptable and has a comprehensive suite of tools available, including read-only tools such as the ability to search data, check the cluster's health, and check the performance metrics. Finally, it has robust security options with different authentication methods.

AI Agent in Action: Automated Root Cause Analysis and Remediation for Black Friday Incidents

We've put together a quick demo to show the power of using this MCP server to connect to OpenSearch with AI agents. For this demo, I'm using Amazon Q Developer CLI, which is now renamed to Q CLI. To easily get started, this is a very simple architecture where we have Q Developer CLI, the MCP server stood up and connected to the Q CLI, and it knows about the OpenSearch cluster through the MCP server.

To set context for the demo that I'm going to be showing you, this is a proof of concept demo. It's Black Friday morning for any company, right? Traffic is ten times normal levels, and suddenly checkout failures start occurring. The engineer team gets alerts, but they're overwhelmed with data from many different microservices also sending alerts. This is where they want to see if the AI observability agent powered by the MCP server can come to the rescue.

So here, I already have Q Developer CLI launched and I've already connected it to the OpenSearch MCP server, which you can see at the top. It's loaded already successfully. So right away we can just start asking questions. The first thing we want to do is look at the indexes that exist in my OpenSearch cluster already. There is an any company app logs index, there is any company metrics index, and then there's a traces index. We've put them in the same cluster, but they could be across different sources.

So the first question we're going to ask is: we're seeing increased error rates across our checkout services. Can you investigate what's happening and provide us a root cause analysis? The agent uses all the tools that it has to figure out what's going on. Immediately, we can see that it starts running queries against the OpenSearch cluster against the indexes to fetch the data, and we can even see the query that it ran . We can see the index name is anycompany-app-logs, we can see the cluster name, and then we can also see additional things such as filter clauses in the query itself. It's filtering for a service called checkout.

This is so useful for analysts because they don't have to write queries themselves. Usually, when you write a query, you have to run it a bunch of times and maybe adjust it. We can see the agent does that as well. The first query didn't get the right results, so it actually went back to another tool called IndexMappingTool to learn more about what metadata exists and what fields are available, and then it rewrites the query using the correct field name. Then it does the same thing again and again to the other indexes that seem relevant. It checks traces as well and also checks metrics, and it keeps going back and forth between running queries, checking metadata for what fields exist in the index, and then readjusting the filters to apply .

It's going through and running multiple queries, which saves hours and hours of time that a human would have to spend doing all of this manually. Finally, it takes all the data gathered from all the queries and synthesizes it together into a single analysis. It tells us, based on the investigation of the data, here's what happens. First, we get a timeline of events, which is super useful. We know that at 10:00 a.m. things were normal, and then at 10:30 there was an incident peak . It gives us the data points to prove why it thinks that's the incident peak because it saw error rates spiking. Then finally, it gives us an actual root cause, which is what we're looking for. The primary cause is that the checkout service became unavailable .

Now we know that this checkout service is really what caused all the subsequent alerts and issues that started the whole snowball effect. It even gives us from the trace analysis one specific service name called process_order which had the error starting . Now we want to dive a little deeper. We say, okay, the checkout errors seem to be related to payment processing . Can you trace the request flow and identify where the actual bottleneck is occurring that caused these issues to happen? Again, it goes through basically the same process . It's running some queries, and you'll see the queries are a little bit different now. The filters are different. It's looking for payment first, and then it synthesizes this data to try to find what's the actual bottleneck that caused that service to fail and give errors .

Now it's going to get the trace flow. We can even see trace IDs, so it identifies that trace 003 and 004 correlate to this issue . We'll dig in deeper to those specific trace IDs and find the correlated logs . This saves hours of time for a human to have to do all this analysis. Then finally, it's going to look further into the logs and now metrics, so any related metrics. If not, it can just kind of ignore that . Now it synthesizes all of that, and let's see what the bottleneck is. The bottleneck identified in this case seems to be the database timeout in the payment service, and it gives us another bit of data that's helpful. But really, at the bottom, you see the root cause, which is where the helpful piece of information is that I was looking for. It tells me what the root cause is and the primary issue .

The primary issue seems to be the connection timed out after 5,000 milliseconds on our database service. Due to the spike of queries against our database, the queries are running slower and then eventually they hit a timeout, which caused the errors, which cause other downstream effects. It even shows us other failures that are correlated. This is super helpful, and then it summarizes it all. If you wanted, you could kind of copy and paste this and send it to leadership to say, okay, here's what's going on . Now, of course, we want to know: okay, we identified the issue. What's the impact? What's the impact of this timeout, and how can we fix it? That's the next natural question. So we'll ask the agent the next question: what's the business impact of this issue actually occurring, and how many customers are affected, and what's the revenue at risk from this issue ?

Now let's look at the business impact. The interesting thing is that we've also given the agent access to non-logs-related data. We've provided this agent access to our sales tables so you can see previous sales history. In this case, it's able to not only associate the logs, traces, and metrics, but to find the impact it actually queries our sales history to find the normal order volume and the average order amount and the order dollar value. This allows it to estimate based on the number of orders we were seeing and how many errors we had and the average dollar value, what the potential lost revenue might be directly. That's really helpful if you want to quickly send a leadership update to say this is how many orders we might have lost because of this issue, and then consider the potential future impact of customers churning because they were not able to complete their checkouts. They would have bought but they didn't. The agent goes through all that math right here. You can see the average order value is from the e-commerce data, and there's normal traffic incident response as well. I'm trying to speed this up for you because I think you get the idea, but there's a lot of good data in there to see the checkout service issues and so on.

Now what we're going to do is ask the agent based on the analysis, what are the recommended remediation steps? How can I fix this issue now and then also how can I prevent this issue from occurring in the future again? That's a different question to ask after we establish the business impact. Now I'll ask the agent to tell me what I can do to fix it, but not just tell me now. Why don't you create a document for me? Save it locally so that it gives me all the steps I need to do, and then I can share that document with my engineers. I can share it with my leadership, and it should give me all the steps that I should take right away and the steps that I need to take in the next week or so, the steps I need to take in the next month, and then in the next few months.

It created a comprehensive remediation plan and it even wrote it out to my local path. You can see it's called Payment Incident Runbook and it goes through all the things I need to do to fix this issue. I used the right tool to do this and then also created a different document which is a runbook for future issues that on-call engineers might face. Now it's doing that. Just similar to the first one, I think you get the idea that it creates a runbook and saves it locally. I just want to show you what that looks like. Here's the payment processing remediation plan that I've just opened up. I don't know why this monitor is jittering like that, but it provides even the database alter statements that I need to do to change the connectivity issue that we saw, and those are the short-term fixes that need to happen right away. It also provides some long-term fixes that we should see. So it's giving us the actual database commands for the engineers to do.

I think we get the idea. That's essentially a quick demo of what can be done. Of course, to productionalize it you can use something like Agent Core to build an agent in production that has similar capabilities. So what we saw today was we saw automatic correlation. The observability agent was able to correlate and connect data from different indexes for logs, metrics, traces, and sales data without any human input. It was able to explain what it found in natural language that was easy to understand for humans, as well as data points that it used.

The AI agent also suggested preventative measures for future events like Black Fridays. In near real time, you could see it updating its database with the memory of this incident so that if something similar happens in the future, it can use that memory to help make the process faster. Since it can remember what happened in the past and what was done to fix it, it wouldn't even have to potentially do all this research the next time around.

Before the AI agent, this would have taken hours and hours and would have cost thousands of dollars while we were trying to figure it out. With the AI agent now, you could see within a few minutes—an eight-minute end-to-end process in the run I did—to come up with all the answers to all our questions. For any company, this could mean a 70 to 80 percent reduction in the incident's impact given the short amount of time it takes.

If you want to go deeper into this type of demonstration and get more hands-on experience, we have a chalk talk coming up later this week at A&T 330. You can see that there, so make sure to check that out. You can ask questions and dive into the architecture for how to build something like this that has even more in-depth capabilities. If you want to learn more about how to use the Model Context Protocol yourself, we have a blog about it, and we also have documentation on all the observability features that we discussed. Make sure you check those out.

If you want to learn more and level up your skills on OpenSearch and other AWS services, check out AWS SkillBuilder. There are thousands of free resources there, and you can start learning right away. Thank you so much for attending our session. Remember to fill out the survey, and if you have any questions, we can be here for a few minutes afterwards. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community