Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Unified multicloud observability (COP368)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Unified multicloud observability (COP368)

In this video, AWS senior solutions architects Nilo Bustani and Santosh Mohanty present observability patterns for multi-cloud environments. They explain two main approaches: the centralized collection pattern, which consolidates telemetry data into a single location using tools like OpenTelemetry Collector, Amazon CloudWatch, and Amazon Managed Grafana, and the federated query pattern, which queries data across multiple sources while keeping it in original locations using Trino or Amazon Athena. The session covers implementation using both open source and cloud-native tools, discusses trade-offs between speed versus flexibility, and shares how Phillips 66 achieved 30% faster mean time to resolution using the centralized pattern with OpenTelemetry and AWS managed services.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Multi-Cloud Observability: Challenges and Strategic Foundations

Getting observability right in multi-cloud environments is critical for our customers. I'm Nilo Bustani, a senior solutions architect at AWS. I'm Santosh Mohanty, senior solutions architect at AWS. Santosh and I have both been working with our multi-cloud customers, and today we'll share some patterns that we see when implementing observability for multi-cloud workloads.

Before we get into that, a quick word about multi-cloud. At AWS, we define multi-cloud as the use of at least two cloud service providers to operate your IT solutions and workloads. AWS is committed to helping you succeed at your multi-cloud strategy.

Customers typically tell us the reason for selecting multi-cloud is either differentiated capabilities, as a result of mergers and acquisitions, or regulatory compliance. With AWS, you'll have the freedom to choose and make the choices that best suit your needs and innovate wherever your workloads are.

AWS's multi-cloud approach is based on the open standards-based cloud maturity model by the Open Alliance for Cloud Adoption. Multi-cloud is not just about technology; it's also a strategic decision. AWS aims to help you create capabilities in both technical and non-technical pillars of people, process, and technology. In this session, we'll be sharing some cloud-agnostic guidance and patterns to help you with your multi-cloud needs.

Coming to the foundations of observability, you're probably already familiar with this. We want to get good at detecting, investigating, and remediating issues with the end goal of reducing our mean time to resolution and understanding our system's behavior. To do this, we collect signals from our applications in the form of metrics, logs, and traces. Metrics, logs, and traces are the first building block of the observability strategy.

Here, we may have to ask how do we instrument our workloads across both clouds so that we can gather telemetry consistently. Then we process this raw data into useful information. At this stage, we may have to ask what signals do we prioritize and keep as we scale.

Our storage strategy answers questions about cost-effective and performance storage for each data type, and then what visualization tools do we use to make sense of all of this data. These questions are going to be part of any observability strategy, but when it comes to multi-cloud, there are a few challenges that get magnified.

One is complexity. Customers tell us that with multiple clouds, there are multiple tools and service dependencies, and teams find it cumbersome and time-consuming to go between dashboards to arrive at the root cause of issues. Then there's process overhead. Multiple clouds means multiple services to instrument, and that means more work.

Team members have focused knowledge in their own tooling, so there's a learning curve to overcome. Customers also tell us that it may take months to get all of their observability tooling to talk to each other. Finally, there's scale. With multiple clouds, you have multiple streams of telemetry, and that increases the amount of data that we need to handle.

How do we tackle ingestion latency? How do we manage our query performance? And how do we make sure that our data egress charges don't spiral out of control? These are problems that impact customers as they adopt multi-cloud, and it doesn't matter which cloud providers you're using. They have a real impact on teams and eventually the business.

Centralized Collection Pattern: Implementation with Open Source and Cloud-Native Tools

This is exactly why we need to discuss systematic approaches and patterns like the ones we're going to do right now. We've seen two main patterns: the centralized collection pattern and the federated query pattern, and then some hybrid combination of these. We'll discuss these two patterns and then also implementation approaches using open source and cloud-native tooling.

You may be using partner solutions for your monitoring, and that's fine. When you look under the hood, you'll find the same patterns at work. Coming to the centralized collection pattern.

Here we're going to take an example with AWS and Google Cloud. We have workloads running in the two clouds. This is just for simplicity; this could be any clouds. We have telemetry being generated in AWS and Google. Now we need something to collect all the signals and export it out, so we have a collector agent that we would have running on our compute workloads.

Now for the infrastructure telemetry, this usually goes into the native tooling that each cloud provider has. So for this we may have to create configurations or we may have to build some mechanisms to export the telemetry. Then we need to make sure that the signals get reliably to its destination, so we have an ingestion layer where we do buffering, routing, and handling back pressure.

In order to make querying easier, we would like to have our data format in a common format, so we have a normalization layer where we collect all of the data and convert it to a standard format. Now all of our data ends up in the storage layer where it holds the full context of our request, and we can query it from our visualization layer to build a global view of our application. So that was the high level for the centralized collection pattern.

Now let's look at an open source implementation of this pattern. For the collector we've chosen the OpenTelemetry Collector. A quick word about OpenTelemetry: it's a standard by CNCF. It's gained popularity because it's vendor neutral and it provides a standard way of collecting and processing your telemetry. It has support for multiple coding languages and auto instrumentation, and there's a growing ecosystem of support around this.

For our ingestion layer, we can choose Kong as the API gateway and router, and we can have Kafka as the durable buffer to handle large volumes of requests. For normalization we can have OpenSearch Data Prepper to standardize timestamps, normalize service names, do aggregation, and then we'll index our logs and traces with OpenSearch. For metrics, Prometheus specializes in time series, so that's what we have. Then Grafana for our visualization layer to correlate everything and to have a global view of your multi-cloud application.

Now if we want to eliminate complexity a little bit more, we can have managed open source services. So we use Amazon Managed Service for Prometheus, Amazon OpenSearch, and Amazon Managed Grafana. Customers love the out of the box security and abilities to scale that comes with these services. For instance, with Amazon Managed Prometheus, you can have one billion time series. You find that the ingestion layer is no longer needed because it's built into the managed services, and all of this is still in keeping with open source and maintaining interoperability.

So let's see a cloud native implementation approach. Here we're going to choose Amazon CloudWatch as our centralized storage. This would be really straightforward for AWS workloads. The infrastructure telemetry is already going into Amazon CloudWatch, and we have the Amazon CloudWatch agent for our compute workloads. We can use the same Amazon CloudWatch agent for our Google Cloud workloads as well.

Now for the infrastructure telemetry in Google Cloud, here we need to build a pipeline to export this, so we can have Google Pub/Sub and then we can do the normalization in AWS Lambda and have all of our telemetry end up in CloudWatch. CloudWatch is also our visualization layer where we build dashboards, we do troubleshooting, we can do anomaly detection and then have automated remediations.

So to sum up the centralized collection pattern, essentially we reduced our complexity by having a standard collector and by normalizing all of our data to a common format. Our process overhead is reduced because teams have fewer tools to learn and implement. And then we have a global and unified view of our multi-cloud application. Now Santosh will cover the federated query pattern. Thank you, Nilo.

Federated Query Pattern: Keeping Data in Place While Achieving Unified Visibility

So we just saw how we can use the centralized collection pattern to consolidate all your data from different sources into one single space to get that unified observability. However, this approach may not be possible for all customers in all scenarios. There may be some use cases where you need to keep your data in its original location. It may be due to compliance reasons or cost reasons. In those use cases, we can use the federated query pattern.

Let's see what the federated query pattern is. As I mentioned before, the federated query pattern allows you to query data across multiple sources while keeping the data in its original location. Let's take Nilo's example. What we have is your application as well as your infrastructure telemetry data that has been generated and collected using collectors and stored locally in AWS and in GCP. The end goal is to create that unified observability for your multicloud application.

In the federated query pattern, we will have a federation layer that will take the query from your visualization tool and split it into multiple subqueries for each cloud provider. Once the queries are executed in each specific cloud provider, the results are then aggregated and presented to the visualization tool. To process and plan the queries, the federation layer uses a metadata catalog to store the schema information.

We just learned what the federated query pattern is. Let's see how we can implement this using open source as well as cloud native tools. Let's first look at the open source approach. For storage, we will store the telemetry data locally, which means we will store all your AWS telemetry data in Amazon S3 and all your GCP telemetry data in GCP storage. In the federation layer, we will use Trino, which is an open source tool, as our query coordinator. We will use Hive Metastore as our metadata catalog to store the schema information. We will also use Grafana as our visualization tool.

Looking at the flow, Grafana will send queries to Trino. Trino will break down these queries into subqueries, and we can use Trino connectors to execute these queries in each cloud provider and retrieve the data from AWS and GCP. Once the data is retrieved from each cloud provider, Trino can aggregate those results and present them to Grafana for visualization. That was the open source approach.

Now let's look at how we can implement this using cloud native as well as managed open source tools. We will keep the storage the same. We will store all your AWS telemetry data in Amazon S3 and your GCP telemetry data in GCP storage. For the federation layer, we will use Amazon Athena as our query coordinator, and we will use AWS Glue as our metadata catalog, which will help the federation layer to plan and process these queries. We will use Amazon Managed Grafana as our visualization tool.

Looking at the flow, Amazon Managed Grafana will send the query to Athena. Athena will process these queries and break them down into multiple subqueries for each cloud provider. Once the data is retrieved from AWS and GCP, Athena will process these results and aggregate and present them to Amazon Managed Grafana for visualization. We just learned about these two patterns and how we can implement them using cloud native as well as open source tools. Now let's look at some key decision points between these two patterns.

Decision Framework, Real-World Success at Phillips 66, and Next Steps

When we look at these two patterns, the choice really comes down to your business drivers. If you're looking for a central or single source of truth, you should go for the centralized collection pattern. Think about how you can get faster analytics, better analytics, or better governance. Those things will be achieved with the centralized collection pattern.

However, if you have compliance or data residency requirements, then you should be looking at the federated query pattern. With a centralized hub, you're optimized for speed and simplicity, which means faster queries and easier governance. The federated query pattern, on the other hand, gives you flexibility and data freshness. Your data stays put, your storage costs are optimized, and you're always working with the latest data.

Let's look at some drawbacks and trade-offs. The centralized collection pattern increases your storage cost. However, the federated query pattern introduces query latency as well as more complex orchestration. While you make your decision, it's critical to consider these trade-offs early in your decision-making process. When you make your decision, consider factors like your compliance requirements, your data freshness requirements, and how cost-sensitive you are. All these questions will lead you to the right design pattern.

Remember, this is not an either-or decision. You can always do a hybrid solution depending on your use case. Now, let's look at a customer story. Phillips 66, a global energy company, had challenges managing thousands of applications across multiple cloud providers. They have around 70% of their workload in AWS, but they also have footprints in multiple cloud providers as well as on-premise. As a solution, they implemented the centralized collection hub pattern and adopted OpenTelemetry for their data collection. They used tools like Amazon Managed Service for Prometheus and Amazon Managed Grafana to get the unified visibility they were looking for. With that, they were able to achieve 30% faster mean time to resolution, which is really great.

Now, let's look at some next steps and resources. Let me highlight some of the breakout sessions at re:Invent 2025 that will deepen your multicloud expertise. These sessions are recorded as well, so they will be available after re:Invent. Also, do not miss our multicloud kiosk at AWS Village. It's a space where you can see live demos, deep dive into specific use cases, and talk to our experts who can guide you and answer any questions you have. If you're tackling any problems, they can help you solve them as well.

I also wanted to highlight some resources that will help you in your cloud journey. Think of them as your toolkit if you're strategizing your multicloud architectures, creating new architectures, or optimizing your existing architectures. These resources can help you move faster. Thank you very much for your time. I hope this session was useful for you.

; This article is entirely auto-generated using Amazon Bedrock.