Kazuya

Posted on Dec 5, 2025 • Edited on Dec 7, 2025

AWS re:Invent 2025 - Scaling open source observability stack feat. Warner Bros Discovery (COP333)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Scaling open source observability stack feat. Warner Bros Discovery (COP333)

In this video, AWS Solutions Architect Vikram Venkataraman, Product Manager Abhi Khanna, and Warner Bros Discovery's Hans Robert discuss scaling open source observability stacks at enterprise level. They explain the three pillars of observability—metrics, logs, and traces—and introduce AWS managed services including Amazon Managed Service for Prometheus, Amazon OpenSearch Service, and Amazon Managed Grafana. Hans shares Warner Bros Discovery's journey supporting 128 million+ subscribers, detailing strategies like operational metadata (OMD) tagging, unified event schemas (OWL), geo-sharding across nine regions, and a custom cost metering framework. The session concludes with best practices for collecting telemetry signals using OpenTelemetry and Fluent Bit, emphasizing sampling, batching, compression, and proper resource management to reduce mean time to identify and resolve operational issues.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Critical Role of Observability at Scale

Good morning, everyone. We have a humble request: if you could wear your headsets and give us a thumbs up, we'd be good to go. Thank you so much. Welcome to COP333: Scaling Open Source Observability Stack featuring Warner Bros Discovery. My name is Vikram Venkataraman, and I'm a Principal Solutions Architect here at AWS. I work with some of our strategic customers that operate AWS at large scale. Along with me, I have Abhi Khanna, Principal Product Manager at AWS, and Hans Robert, Director at Warner Bros Discovery.

Let me set the scene for you. Imagine it's the night of the season finale of your favorite OTT show, and you, along with millions of users, are waiting to watch this show. You press the play button simultaneously, only to find that annoying revolving circle that just rotates and never stops. Meanwhile, the provider is thinking: what's wrong with my platform? Is it the authentication system? Is it the payment system? Or is it both? Unfortunately, that's the state of running systems at large scale. When you have hundreds of thousands of microservices running on various compute platforms, the challenge isn't about collecting metrics, logs, and traces. It's all about making insights on top of these metrics, logs, and traces. That's exactly where modern observability can help us. Rather than just being a mere monitoring tool, it can help you identify where the problem is, what the root cause was, and how you can fix it. That's going to be the focus of today's session.

In a few minutes' time, you'll find Hans here sharing the stage with us. He's going to talk about how Warner Bros Discovery derived insights out of these metrics, logs, and traces stored at large scale and was able to bring down the mean time to identify and remediate operational issues. As for the agenda, we'll start with observability evolution, where we try to understand the basic blocks of observability: metrics, logs, and traces. Then we'll dive deep into some of our managed open source services that can help you store these signals in a reliable and durable way. After that, we'll hand it over to Hans to talk about Warner Bros Discovery's observability journey to AWS, and we'll wrap up the session with best practices for operating these observability tools at scale.

Quick show of hands: how many of you here operate 50 or more monitoring dashboards in your environment? Keep your hands raised if you have 100 or more dashboards. 500? 1,000? Wow, that's the challenge when you have too many dashboards. Correlating the insights out of these dashboards isn't going to be that easy. Hopefully, we can learn something from this session and be able to apply some of those best practices in your environment.

I really wanted to emphasize the importance of having a solid observability stack. Look at these examples: e-commerce platforms, healthcare providers, commercial banks. They all suffered huge financial losses just because they were flying blind. They didn't have a good observability stack, and it took them hours to identify where the problem was and fix the issue. As our CTO says, everything fails all the time, but if you engineer your applications for those failures—meaning if you have the right KPIs in place and have the ability to correlate these telemetry signals—that's going to put you in a peaceful spot.

Understanding Observability Fundamentals: A Layered Approach to Metrics, Logs, and Traces

Let's understand the fundamental blocks of observability: metrics, logs, and traces. Logs, in simple terms, are nothing but how a human would record day-to-day stuff in a diary. It has a timestamp and a detailed description of the event. That's pretty much it. Metrics, on the other hand, give you the ability to look at an application's pulse. Is my application running fast? Is it slow? How many 4XX errors? How many 5XX errors? It's going to help you dive deep into a specific component within your application. Traces give you the ability to capture your end user's request as it flows through the complex microservices architecture—from, let's say, a user ordering something on an e-commerce website to getting that product delivered to them. It captures everything.

But this talk is not going to be about collecting metrics, logs, and traces. It's about how you can reduce mean time to identify and resolve any operational issues. We thought about it from the lens of an on-call engineer. Any on-call engineers or SREs in the house? There you go. From your perspective, you wake up one early morning and find out there are thousands of alerts waiting to be attended. In order to get to the root cause of the problem, you may have to go through 50,000 time series of metrics and 2 petabytes of logs, and remember you're already racing against the time. When a 5-minute fix takes 5 hours for you to troubleshoot,

we aren't talking about a monitoring crisis or a monitoring challenge. The challenge is with correlating all these telemetry signals. We thought about it and came up with this layered approach where the first layer is all about starting with something. Use the out-of-the-box monitoring you get a page alert, you get a down alert, you get a 100% CPU utilization alert. That's going to be your starting point. It's going to tell you that a particular application is impacted, and then dive deeper into the second layer.

Get to the logs of that specific application and see what's happening. Maybe you're getting a lot of 5XX errors. Let's dive deep into the logs and understand what could be the probable root cause of the 5XX errors. Then comes the third layer where you enhance this information with trace information. Oftentimes it's not that specific application that is the root cause. Let's say you get a down alert. It's not maybe related to that application, but it might be related to your dependent services. That's exactly where trace can be handy. The 5XX errors I'm talking about could be a result of a backend database failing out, and trace can help you identify that information because it's going to give you the time spent by each of these requests in each of the microservices that forms your complex microservices architecture.

Finally comes the critical piece, which is the correlating piece. This is exactly where you have to inject some kind of metadata so that when you troubleshoot the issues you can correlate all the information from metrics, logs, and traces and you can pinpoint what could be the probable root cause. This is also an area where you could use AI because AI out of the box can pull all the information from your telemetry signals and can lead you to a root cause of what you might be facing. With that, I'm going to hand it over to Abhi Khanna to talk about some of our managed open source services that can help you store these signals reliably.

AWS Managed Open Source Observability Solutions: Building an Integrated Stack

Thank you so much, Vikram. Here at AWS we've built together a variety of different open source observability tools. From collection we support OpenTelemetry as well as managed collectors for Prometheus Metrics. From an ingestion, storage, and analysis perspective, we've got the Amazon Managed Service for Prometheus for Prometheus Metrics and the Amazon OpenSearch Service for logs and traces, and a variety of visualization capabilities stretching across Amazon Managed Grafana as well as OpenSearch Dashboards to stitch it all together.

Many customers have told us that they prefer open source. So we often ask why and the things that resonate are open standard support, allowing you to have a no vendor lock-in strategy, getting the best that the community has to offer, being able to get lots of transparency and control over what you're running, and then of course being able to tweak at your will to get the cost efficiencies you need in your observability platform.

But unfortunately it's not as easy as just a Helm install, put it into your production stack, and be ready to go. Things don't scale, they're not reliable. They tend to get overloaded when you have too many teams using them at the same time. All of this creates friction. You have to deal with resource management and capacity planning issues, high availability problems, tenancy controls and isolation, just dealing with the sheer number of servers you have to manage, patch them, and upgrade them.

And then of course you have to monitor all of these systems to make sure that they stay up and running so that your teams can use them to monitor their actual applications. All of this adds operational overhead that gets in your way of leveraging these open source technologies to help you solve these problems.

Here at AWS we've looked at how we provide managed open source solutions. Firstly, we're looking at how we make them end-to-end so that you're not sitting there cobbling together all of the pieces. How we bake in correlation so you can correlate across metrics, logs, and traces directly through the product experiences. How we make sure that you have visibility into the cost elements and that you have cost controls that you can use to tweak your cost efficiency. And how we provide this at the AWS guarantee for security at the scale you need to support your enterprise workloads seamlessly integrated with all the other AWS technologies such as IAM, CloudFormation, CloudTrail, and of course all the improvements we make, how we make sure we're giving them back to the community so those communities prosper and get the same benefits with regards to scaling and availability.

Just in case you're not familiar, here's a quick overview of the Amazon Managed Service for Prometheus. It's a highly available, secure, and managed metric offering. It's serverless, fully Prometheus compatible, supports the same open source Prometheus data model and query language. It's all pay-as-you-go and of course supports any environment at any scale. Last year we launched support for a billion active time series per workspace. You can have as many workspaces as you'd like, allowing you to have billions of metrics stored and analyzed in one system.

Of course we also have Amazon Managed Grafana, which is our managed offering for the open source Grafana project that allows you to pull from a variety of different data sources and build beautiful dashboards, creating that single pane of glass feel.

And then finally, we have the OpenSearch Service, which gives you real-time search and analytics capability across your logs and trace data. It's fully managed, secured to AWS standards, supports log analytics and observability use cases with built-in experiences, and is deeply integrated into the AWS ecosystem.

Putting it all together, you now have the ability to use OpenTelemetry to collect logs, metrics, and traces, store your metrics in Amazon Managed Service for Prometheus, logs and traces in OpenSearch, and then visualize everything across Grafana and OpenSearch dashboards. Regardless of where in the AWS ecosystem you're collecting this data from, you now have the ability to source that data, collect it, refine it, ingest it properly, analyze it in these systems, and pull it all together in a visualization layer of your choice, allowing you to create that single pane of glass view.

Our customers deploy this in two different modes. One is a centralized mode where across all of their regions and accounts, they pull all the data into one central location with these central open source observability stacks, which allows them to get a cohesive view of all the different places they're running their applications across their AWS footprint in one single location. Other customers have developed a more geo-local strategy where they have a single stack per region, centralizing across all the accounts in that region, and then use a common visualization layer to stitch it all back together at the end.

Warner Bros Discovery's Journey: Scaling Observability for 128 Million Subscribers

With that, I'll hand it off to Hans to talk about the Warner Bros. Discovery story. So we just heard about those amazing tools and products that AWS has made available for us. They've never been more powerful than they are today, and yet tools have limits in terms of how much they scale. For really succeeding at the enterprise level, you need scaling strategies, and that's what I want to talk about today.

Good morning, everyone. I'm Hans Robert, Director of Observability at Warner Bros. Discovery. Warner Bros. Discovery is fundamentally a story of a company of storytellers. We connect audiences with powerful stories and narratives, and we're the home of iconic brands like CNN, Food Network, HBO, and TNT, among others. We operate a very complex ecosystem that is crystallized in our streaming world, comprised of HBO Max and Discovery Plus. We support 128 million-plus subscribers across 100-plus markets, all on one multi-tenant platform. The volume of logs and metrics that we collect on a daily basis is quite staggering.

For us, collecting all those logs, metrics, and traces is more than just something we have to do. It's really an operational imperative and a retention must-have. Today, I'm going to share some of the learnings and strategies that we've used to scale the observability platform that Abby was talking about to actually meet that enterprise challenge and scale to those numbers.

When we got started a few years ago, right after the merger of Discovery and Warner Brothers, we were facing the classic technology merger challenges of having multiple different observability solutions. Different teams were using different tools and looking at different data, which meant very slow troubleshooting and identification of issues. We also suffered from lack of data standardization. Inconsistent data formats and taxonomy made correlating insights across teams a complicated issue. And finally, most critically for what we're talking about today, this hurt our scale challenges, making every event, traffic spike, or just keeping up with subscriber growth a bit of a headache that we needed to solve.

This legacy state was inefficient and unsustainable. It didn't work well with our desire to reduce MTTI and resolve incidents as quickly as we wanted to. So we built a completely new platform based on those open source technologies that Abby was talking about. Before I talk about the strategies, I want to talk a little bit about open source and why open source.

Abby mentioned a few of those reasons, but I want to double down on some of them. First, we wanted to avoid what we call the one-way door—the kind of decisions or choices that are very hard to revert or evolve over time when your needs arise, or being locked into one specific proprietary vendor technology or strategy. Second, we wanted to build an opinionated platform where we actually build a lot of the guardrails and golden pathways straight within the platform. The guardrails are there to prevent user mistakes, errors, and data spikes that would cause problems for the platform, and the golden pathways are there to help teams and engineers onboard onto the platform and leverage the observability platform as best as they can.

Then finally, and maybe most critically from a cost perspective, we wanted to have full control over our data and the pipeline. Both from a data pipeline and data retention and storage perspective, we wanted to make sure that we knew exactly how and when we collected the data and where we store it. Let's talk about some of the strategies that allowed us to scale our platform to what we have today. I'll talk about three different things that matter to us a fair amount: data organizing, applying and optimizing our system through sharing, and then finally cost monitoring and how we manage our costs. My goal today is to give you some strategies that you can apply to your industry irrespective of which one it is. This is not specific to video streaming, and you can apply and hopefully use it the exact same way that we've managed to scale up.

Data Organization Strategy: Operational Metadata and Unified Event Schemas

So let's start with data organizing and how we've tamed the chaotic volume of information by organizing it. I'll touch on a couple of things, but maybe first of all, the industry cliché is that data is the new oil, but frankly, in my cautionary view, data at scale is chaotic by default unless we organize it. Today we focus on two pillars of how we organize the data. First, establishing operational metadata, or OMD for short, to tag and track our data flows. And secondly, implementing a unified event schema across both logs and traces.

Operational metadata is not an observability-specific strategy, but as you'll see later, it's foundational to a number of things that we've done to both shard and scale. I want to make sure that you get a good grasp of that. We wanted to establish a functional hierarchy for our streaming ecosystem for all of the applications and infrastructure that we have. You can see on the screen a simple three-tiered hierarchy of systems with business services at the top, like video playback in this case, which represents a high-level business capability. Below that we have the OMD service, like markers, which is a logical feature within that capability. And then the last one, the OMD component, is really the final microservice that implements a specific feature within that capability. In case you're wondering, markers is how we track what and how you've been watching your streaming choices so that if you pause and want to resume later, we know where you were.

Establishing this hierarchy, we needed to really distribute the authorship and the collection of that data. We solved that by having the engineers put all that information into a standard file format in GitHub alongside their code. We implemented some automated checks to enforce data quality. Then on top of that we're essentially scraping all the data and collecting it and storing it in Amazon Aurora, on top of which we built a graph and a dashboard called Service Catalog that gives us the entire topology of our system. All of this is providing a single source of truth—the operational and functional map of our entire platform, meaning the application platform and the streaming platform. But as you'll see, this has great relevance for observability later.

Now that we have our operational metadata established, the second critical goal was really to map all of our observability data.

Whether that's logs, traces, or metrics, we needed to map it all to this functional hierarchy. Here there's an example of a log on which we've applied the business service, service, and component tags: video playback, markers, and collectors. That data is now tagged and organized.

To scale with this tagging, automation was essential. We engineered a solution to enforce standardization at the collector level. In our Communities cluster, we're running Fluent D to collect the logs. Alongside Fluent D, we have an add-in that automatically tags the incoming data from Fluent D with information collected at the node level that contains the business service, service, and components. Now that we've tagged all of this information directly onto the message that we send to the observability account, we know exactly where the data comes from, where it belongs, and what it's relevant for.

We've done something very similar for metrics and traces, so all of our metrics have the operational metadata business service, service, and component, and similarly traces have the same thing. This automation guarantees that every piece of data destined to our observability account is correctly tagged regardless of volume. That's important for scale and is fundamental to the sharding strategy and cost metering strategies that I'll discuss shortly.

Before I get there, I want to talk about the other aspects of organizing data, particularly for event data like logs or traces. Without establishing a strong event schema or standard, every service and component, meaning every team or developer within your company, operates independently. This typically leads to schema drift, where even a simple entity like a client identifier can be represented in multiple different ways: client underscore ID, client dash ID, CL ID, and so on. This is not just a semantic issue; it's an operational blocker.

You cannot join data with different identifiers easily and efficiently. In addition, it really impacts indexing and cross-service correlation, and it massively increases query complexity and the cost of storing that data. Solving schema proliferation was important to achieving our scaling observability platform. We've introduced the organization-wide logging schema, which we've shortened to OWL. OWL is built upon a mandatory core log schema, so every log, no matter what, has non-negotiable fields like timestamp or severity and other things like message or the operational metadata fields that I mentioned a moment ago.

On top of that, we can have business service aligned schemas that define things more relevant to each business service. For the payment log schema, you might want to add the payment method type or payment references. For video playback, you might add video ID or stream ID. The core log schemas and business schemas are then merged into a specific business-oriented schema that gets injected into the OpenSearch index. We now have a very dedicated schema but with common fields originating from the core schema.

This enforced mapping ensures that if any event attempts to ingest something not defined as part of the schema, the field can be rejected. We can reject the message or just the irrelevant fields, or store that in S3 for further review. Then we can send a notification to the author of those logs and tell them to fix it and try again. One of the things that has changed is part of the culture and conversation. Knowing that some events might be rejected if the log goes outside the boundaries has really made people think more carefully about what and when to log things and potentially having a much more intentional way of sending information to the observability platform rather than just filling a data swamp with stuff.

By combining operational metadata and our schema, we've achieved several critical outcomes. The reduced MTTR really comes from the data quality, but also from data standardization and the ability to connect different services and microservices together. Intentional consumption means people now think twice about what to log and what metrics to log rather than just filling up the number of observability telemetry infinitely. Finally, it's a strong foundation for scale, and you'll see how we're going to use the operational metadata itself to build for sharding and also for cost metering.

Sharding and Abstraction: Building a Flexible Multi-Region Architecture

That gets us to sharding. We've adopted the geolocation approach, which we'll get to in a moment, but also talk about the logical partitioning that we needed to do in order to reach a point where nothing is too massive. We also need to discuss abstracting shards because of the problems that shards create. To serve our 100+ markets in nine regions globally, our business applications are distributed across those nine regions, and most of our components and computers are running in EKS with a supporting cast of services like database and storage that are all monitored by CloudWatch.

Looking at logs first, we have Fluent D collecting logs directly from the pods in EKS and then forwarding that data to a regional OpenSearch in the same region that the data originated from. We replicate that across every region that we have around the world. Metrics have a very similar path. We use the Grafana Agent to scrape EKS metrics directly from each pod and then send that to an Amazon Managed Service for Prometheus workspace that lives in that same region. Finally, to bridge the gap with Amazon CloudWatch, we've leveraged YACE, Yet Another CloudWatch Exporter, that extracts the data from CloudWatch and forwards it to the same Amazon Managed Service for Prometheus in that same region so we can add correlation between infrastructure and application, which gives us a step forward toward a single pane of glass.

While geo-sharding handles the physical location and provides already some form of sharding, we also need a strategy to organize logically so that nothing within that region becomes too massive. For instance, if you need 1.1 billion metrics in Prometheus and it now goes to a billion, you need more than one. For metrics, the Grafana Agent reads the OMD label and uses that to make decisions. Rather than send everything to the same huge backend, we shard that leveraging the business service, service, or component as needed in order to get optimal distribution. We then apply the exact same logic on logs and traces, in this case leveraging Fluent D that forwards things to Firehose and eventually to the relevant index in OpenSearch or instance where it's essentially automatically indexed within the relevant logical index for that business service, service, or component.

With this strategy, we're preventing not only things from becoming too large, but also preventing one service or one component or one business service from creating a traffic spike that would noise the neighbor or impact the ingestion performance of another service that happens to live in the same ecosystem. For risky components or things that are very spiky in nature, we tend to isolate them onto their own dedicated shards so they can only impact themselves.

I talked a little bit about shard limitations. The main issue with shards and the partitioning that we explored here is that you can no longer just have Grafana doing one query. It really has to know intimately the topology of your system in the backend and query the data from where it belongs. Knowing which workspace actually contains your data—whether it's the one in US East, Europe, or the one for video service—can lead to something that's quite complex and complicated to manage.

We have the same friction with logging and tracing where the OpenSearch Dashboard also needs to know where that transaction or log was ingested from in order to capture that specific message and address the exact OpenSearch cluster that contains that message. As you can imagine, this is not really a practical way of implementing that. It gets tedious pretty quickly and slows down the MTTR that Vikram was talking about that we're really going after.

So we introduced a couple of abstractions in this case. Starting with metrics at the top, we introduced Proxy, which is an open source aggregating proxy that sits in front of your data but looks from a Grafana perspective exactly like a Prometheus instance. Proxy acts as one giant metrics workspace, and then under the hood, just like shards, it distributes the query across all of these different workspaces that it knows about. It brings back all the data series, puts them back together, and then feeds that back to Grafana.

With Proxy, we effectively have a complete abstraction so that Grafana doesn't need to know how many workspaces there are behind the scenes—whether it's one or seventy. For logs and traces, we have leveraged OpenSearch's native cross-cluster search capability. It does pretty much the same thing as what Proxy is doing. It distributes that request to the different underlying OpenSearch workspaces or clusters that you have behind the scenes and re-aggregates that together before giving it back to the OpenSearch Dashboard.

Not only have we made it easy and efficient for people to write queries, dashboards, build alerts, or any kind of visualization on top of the data, but importantly, we also completely disconnected the lifecycle of the backend—the need for restructuring and rearchitecting how many shards and where they live—from the consumption on the front end. We're able to change all of that without having to impact the consumer. A dashboard that was written in this architecture with three workspaces will work tomorrow with two workspaces or three or five workspaces as we scale because it's still talking to the exact same Proxy instance. The same thing applies for the OpenSearch Dashboard, which only knows about the one cross-cluster search instance that will then do the federation.

To recap, this combination of geosharding, logical partitioning, and the abstraction that we've built has really given us a very flexible and scalable architecture. We've effectively futureproofed the platform. If the business decides to launch in a new region tomorrow or if we expect a specific service to have a ten times increase in traffic, or if there are new capabilities that would generate a ton of logs and metrics, we can do that pretty simply by just adding a new shard or launching a new cluster without disrupting any of the SREs or engineers that are leveraging Grafana and OpenSearch Dashboard to create that data.

Cost Metering Framework: Financial Accountability in Shared Observability Environments

In turn, we move to the final strategy this time around: cost metering. If you recognize this person on the screen, you know what this section is about. If you don't recognize them, they're Roy Logan from Succession, a show I would highly recommend watching on HBO. But if you still don't know who that is, you can think of the CFO or CEO of your company. Those people don't care about high cardinality or index fragmentation. They care about the bottom line. To survive at scale, we had to implement some kind of financial showback or chargeback.

This approach helps us understand where the costs are coming from and ensures that observability doesn't become the most expensive line item on the AWS bill. There's a specific challenge within AWS to do cost attribution in a shared environment. Imagine we have two services: payment and video playback. Both send logs and traces to one OpenSearch service backend. Similarly, they're producing metrics, most likely at a very different cardinality with completely different time series numbers, and they all end up in that same AMP workspace or AMP instance.

Today, the native strategy within AWS is to tag the instance. This means that if you navigate to the Cost Explorer, you're going to know the total amount that we spend for OpenSearch or for Prometheus, but not the detail of who's actually generating that cost. Is it payment? Is it video playback? How do we sort that out? To close this attribution gap, we've engineered a bespoke cost metering framework.

This framework essentially triangulates three different types of data. First, the cost—we can't live without that. We extract it from AWS and store it into an analytics platform. Second, we get the usage of both OpenSearch services and AMP by collecting specific volume metrics, such as log gigabytes, number of traces, and number of time series in AMP. We then generate a new time series that we store in Prometheus. Third, we allow teams to enter allocation rules, defining how they want to partition and mathematically divide that amount of money based on the rules and the cost we have there.

The magic of the reconciliation happens in the middle here in Lambda. This runs on a trigger and essentially merges all these things by mathematically dividing the cost that we've collected for both instances and saying you spent more, you pay more, and so on. We then create another time series that we store in Amazon Timestream. Finally, we built a dashboard in Grafana, and we have a dedicated dashboard for each business service that now knows very well the specific costs associated with their observability platform.

On top of that transparency and visualization, we've also built an accountability framework. It's not just sufficient to have that available. There also needs to be accountability going behind the scenes. This has changed the culture a fair amount. We've moved from a conversation about why the AWS bill is so high to a more constructive discussion about how my team can contribute to really managing the cost efficiently and appropriately.

Now that we have the three strategies implemented, we've successfully built a platform for the future of streaming. By implementing the strategies that we shared today, we established three powerful pillars. First, a global observability solution that's scalable, leveraging managed services like Amazon Managed Service for Prometheus and Amazon OpenSearch Service. Second, we transformed our data into a strategic asset by standardizing operational metadata and schemas. We've ensured that every metric, log, and trace is immediately valuable, searchable, and trusted by the engineers that rely on them. Third, we've unlocked scalability through intelligent sharing and our cost metering framework, giving us an infrastructure that scales alongside our subscriber growth and allows us to expand to new regions or handle spike traffic if we need to.

If your organization is facing similar growth challenges or issues, I highly encourage you to look at those strategies.

Best Practices for Operating Observability Tools: Collection, Ingestion, and Optimization

Before we conclude the session, I want to hand it back to Vikram, who is going to talk to you a little bit more about how to operationalize a scalable observability platform. Thanks. Alright, we're going to wrap it up with sharing some of the best practices of operating observability tools at scale. The way I've divided this is into two sections. We'll talk about the best practices of collecting the telemetry signals, namely metrics, logs, and traces, using your collector of your choice—OpenTelemetry for metrics and traces, and then for logs, maybe Fluent Bit. You also have the OpenSearch ingestion pipelines, so that's going to be the first lesson or the first best practices that we're going to talk about. And then the second one is about how we can ingest these telemetry signals into Amazon managed open source services in a reliable and durable way.

It's a no-brainer to start with sampling and batching. Many of you might be doing this already. When I say sampling, we only keep the percentage of data that we care about, which goes a long way when you're dealing with huge volumes of traces. When batching, we group the signals before we ingest them into the backend. By following both sampling and batching, you're making sure that you're not overwhelming the backend as well as reducing the overall network overhead as a result of ingesting these telemetry signals.

When it comes to metrics, these are high cardinality information. So at any point, if you think any of these labels might not add a huge ton of value to you, I would request you to drop those unnecessary labels. Especially, you have this action config within your OpenTelemetry collector or Prometheus collector. It's very straightforward. You just have to specify the label and simply drop it. What it's going to do is keep your time series metrics database smaller as well as improve the network traffic from your collector to the backend managed services.

Many of our collectors have these best practices out of the box. For example, compressing is one of them. OpenTelemetry supports GZIP compression out of the box, so you could use that to compress these telemetry signals before you send them out. You can also enhance your telemetry data by adding necessary metadata into your telemetry signals. For example, you could add the deployment config, the pod name, or the namespace name. It's going to further enrich your telemetry signals that you're collecting.

When it comes to logs, you definitely have to monitor and optimize your buffer usage. You have to configure a dead letter queue so that if some of your requests fail, you don't drop them, but instead you move them onto the dead letter queue and try revisiting it or troubleshoot what happened. Finally, the most important piece is that you have to definitely monitor your monitoring tool. You have to set up some kind of an autoscaling strategy. OpenTelemetry out of the box gives you a lot of metrics that you'll have to keep an eye on, and also make sure that you have sufficient resources configured to your collector and have some kind of autoscaling strategy.

Lesson two is about how we can better ingest these telemetry signals into Amazon Managed Service for Prometheus and Amazon OpenSearch Service. Again, these are out of the box features that are available. Whenever it comes to complex queries, it is highly recommended to use recording rules. That way, you don't have a complex query or overwhelm your Prometheus database. Instead, all these recording rules will be computed at regular intervals and will be transformed as separate time series.

Amazon Managed Service for Prometheus also announced label-based active time series in a multi-tenant environment. Where you have non-mission critical applications and mission critical applications running on the same clusters, you can specify a cap on how much of an active time series that you would be willing to use for a specific application. So you can scope them down by labels. For non-mission critical time series, I only need like 20,000 time series. For the mission critical, maybe I need like 1 million time series, so you can scope them based on the labels.

You definitely can implement query controls and query logging. This gives you the ability to specify how many samples should I traverse through to make sure this query runs. You wouldn't want to run your query against a huge number of samples, but you can specify the number of samples that you could use.

When it comes to Amazon OpenSearch Service, the first and foremost thing you have to do is calculate the storage requirements. You'll need to account for index overhead, replicas, your data flows, and make sure you select the right instance type and storage requirements. You absolutely must have a sharding strategy. A general rule of thumb is to have less than 25 shards per gigabyte of your heap space. You have to control the ingest flow and buffering, and we have the OpenSearch Ingestion pipeline that can take care of that.

We recommend using OpenSearch Ingestion pipeline in tandem with your Amazon OpenSearch Service so that you can control the log flow. Finally, you have to optimize the bulk request and compression. Use the bulk request API so that you can batch your log flow and compress it before sending it to the backend, which is your Amazon OpenSearch Service.

I also wanted to highlight some of our recent announcements we made on AWS managed open source services this year. If any of you have questions on these features, we'll be more than happy to answer your questions in the hallway. We covered some of those important announcements as part of our best practices discussion. Thank you all for stopping by and attending our session.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community