Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - From Metrics to Management: Practical Observability on EKS (DEV202)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - From Metrics to Management: Practical Observability on EKS (DEV202)

In this video, Dale Orders and Emil Lubikowski present practical observability on EKS using AWS managed services. They explain observability as a continuous cycle of detecting, investigating, remediating, and assessing application health through four signals: metrics, traces, logs, and profiles. The session demonstrates how Amazon Managed Service for Prometheus and Amazon Managed Grafana work together to monitor containerized workloads, using OpenTelemetry collectors to aggregate data. Best practices include using PrivateLink, IAM for security, adjusting scrape intervals, and creating dashboards that tell clear stories. A live demo shows configuring scrapers, setting retention periods, establishing IAM roles, and building visualizations in Grafana to achieve SLO and SLA compliance targets.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Understanding Observability: The Four Signals and Their Role in Application Health

Welcome to our session called From Metrics to Management: Practical Observability on EKS. My name is Dale Orders, and I'm here with Emil Lubikowski, who will be joining me in presenting this session today. We have an agenda for today's session that you can see on the screen. We're going to start by introducing ourselves, then discuss what observability is with a brief introduction to the concept. We'll then explore how we can make our applications observable and delve into two AWS services: Amazon Managed Service for Prometheus and Amazon Managed Grafana. We'll conclude our presentation with a short demo.

My name is Dale Orders, and I'm a software engineer at Five9. I'm joined by Emil Lubikowski, who is an AWS DevOps engineer. We're both community builders, and we're very excited to be here today with you to discuss observability.

We're going to first define exactly what we mean when we talk about observability. Observability is about extracting actionable insights—insights that you can generate to assess and improve the performance, health, and behavior of your application. That's a really important point: we want to look at the performance, behavior, and health of the application.

Observability should be a continuous cycle. It's not something you do one time and then finish. It's something you should be doing in continuous iteration. As you see on the screen, we start by detecting the specific metrics we are looking for. We then have to investigate what those metrics mean. We have to remediate any issues we find through our investigations. Finally, we have to assess whether those remediations have been successful. It's a continuous cycle that we must do to ensure observability is truly effective.

Why is observability important? There are many reasons. For us, it enables us to gain insight and visibility into the health of our application. It means we are able to better troubleshoot any issues that arise. Importantly, it allows us to deliver a superior customer experience, and that is something very important to us since we build for the customer. Finally, it helps us control our costs, and that too is equally important.

When we say observability, what signals are we talking about? We are talking about metrics. We can think about metrics as answering the question: do I have a problem? It also includes traces. Traces answer the question: where is the problem? It also includes logs, which answer: what is causing the problem? Finally, it includes profiles, which answer: how is the code behaving? Profiles give you insight into the specifics around your codebase. Those are the four observability signals we need to be aware of when we conduct observability or design an observability system.

How exactly does that relate to our application? When we have metrics, we're thinking about questions like: why is my application 25% slower than last week? We have to be able to interpret that information and understand that it refers to a metric in our application. With traces, we ask: what are the dependencies for this service? We're looking at how one specific service relates to the performance of another service in our application. It also includes logs. For example: why won't my database start after this update? There we are trying to troubleshoot an issue that has arisen with our application. And it includes profiles. For instance: could there be a memory leak in my code?

All of these observability signals allow us to better track and achieve our SLO and SLA compliance targets. It's not something we do simply for the sake of it; we do it because it has a broader purpose behind it.

Making Applications Observable with Amazon Managed Service for Prometheus and Amazon Managed Grafana

But then the question becomes: how do we make our applications observable? We know observability is important, but how do we actually introduce observability into our application? One way we can do that is by using a particular AWS service known as Amazon Managed Service for Prometheus. This is a fully managed serverless Prometheus-compatible monitoring service.

It is based on Prometheus, which is an open source program, but it is developed by Amazon. It provides high availability, it is multi-AZ by design, and it provides the ability for you to gain greater insight into the monitoring and the observable metrics that are being produced by your application. It uses the Prometheus data model and querying language, PromQL, which allows for seamless monitoring of your containerized workloads. We will see that in the demo very shortly.

What does that look like in terms of our architecture? You'll see here on the screen we have our application code, which we instrument with Prometheus using the Prometheus SDK. We expose the metrics to the Prometheus server, which is running on the Kubernetes cluster. We're going to be using OpenTelemetry, so you can use either an OTEL or OpenTelemetry, or an AWS Distro for OpenTelemetry collector to collect and aggregate the metrics. Then you can use a remote write operation to send those metrics to Amazon Managed Service for Prometheus.

If we look at that in a little bit more detail, we see that we have our collector. This collector is going to use both a receiver and an exporter to extract, process, and export the data. In this case here, I am producing two pipelines. On the left-hand side here, I'm producing a traces pipeline, which is going to export the data to Amazon AWS X-Ray. On the right-hand side, I have a metrics pipeline which is going to export the data to Amazon Managed Service for Prometheus. This is one way that you could design your application to be more observable.

What are some of the best practices when it comes to using this specific service? You should be using PrivateLink and IAM so that your metrics go securely from trusted sources. You can reduce your costs by increasing the scrape interval and also adjusting the relabel config filter to better suit the specific requirements of your application.

You can use a retention period that is appropriate for the data that you're storing. If you're not going to be needing your data for long periods of time, then it's more appropriate to reduce that retention period. You can run more than one container to ensure high availability of your pipeline. Finally, you can use consistent identifiers to enable correlation of data, so you can better correlate data from one pipeline to the next.

With that said, there is one other service that you can use, and I'm going to pass it to Emil to tell you about that specific service. Thanks for the intro. The next fully managed service is Amazon Managed Grafana. In the abbreviation, AMG. Grafana is actually a service you possibly already know. Some of you already were using Grafana. As I mentioned, AWS fully manages that for us. It's a widely-used open source analytics visualization platform. We'll be showing how to create some visualizations for that.

It is also used for alerting, creating metrics and traces. It provides an interface for us. Amazon Managed Service for Prometheus actually works on top of our current setup, which we've already mentioned. We have our application, and we are scraping data with OpenTelemetry, AWS Distro for OpenTelemetry, and Amazon Managed Service for Prometheus. Then we move it to Amazon Managed Service for Prometheus and configure Grafana to display that information to the user and to us.

As previously mentioned, we can use OpenTelemetry and AWS Distro for OpenTelemetry to get those metrics in a few different ways. We can push that data to multiple services such as OpenSearch or OpenSearch Cloud. We can also push to AWS X-Ray. However, we will focus on the approach where we push metrics through Amazon Managed Service for Prometheus and then to Grafana.

Practical Demo: Implementing Best Practices and Visualizing Metrics on EKS

For both Amazon Managed Service for Prometheus and Amazon Managed Grafana, there are best practices we follow. We implement these practices to make our dashboards useful and clear, and to answer simple questions. One best practice I'm really focused on is that when you are creating dashboards, make sure each one tells a clear story and answers a specific question to make it really specific and transparent.

Now let's move to the demo. We are going to visualize our web-quickstart cluster with some already provisioned resources and observability. In the observability section at the top, we already see the Prometheus metrics section. We can add a scraper here. We can also create a new workspace and add other configuration, but we mostly focus on the configuration, which is like a YAML file that specifies the best practices we mentioned. This includes the scrape interval.

We also have relabel configs for label mapping between agents and keep filters for selecting only the labels we are really interested in. After creating the scraper, we can see that it is already visible in Amazon Managed Service for Prometheus. In Amazon Managed Service for Prometheus, the scraper is visible, and the workspace we created is also visible. In our collection, we can write and query our data that has already been collected. We can also set the retention period here, which is really important from the cost and management perspective. Now we can set up the ingestion for our collection.

There is really good documentation on the AWS side on how to follow these steps. There are two main steps. First, we need to create an IAM role, which allows us to communicate between the services. The next step will be to verify the role. After running all the steps, we need to verify that it is already there and we can use it. If it is there, we can see the rights allowing us to work with this service. There will be another step with providing the agents. Once again, there is pretty extensive documentation available.

Even if you are doing it for the first time, it is easier for us to practice and set up quickly. For this purpose, we just used the default naming so it will be quickly available and already there in our EKS deployment.

As we mentioned, the setup works with Prometheus agents that must be deployed on our cluster and our app to write our data and logs to our AMP service. AMP will scrape that data and push it to Grafana. Now we have Grafana, which is another fully managed and simple to configure service in AWS. There are just a couple of things to configure here. We can focus on our data source, which will be AMP Prometheus. What we have to create is just a workspace. After creation, we see that it's already done. Now we have to focus on configuring Grafana itself, which is provisioned for us after these couple of steps. After logging in, the couple of things we have to do is obviously set the data source. As you can see, it's already been done previously, but we can add a new one. The only thing we have to focus on here is the connection, because the service is pretty clear before that. After deployment, we just have to point to our AMP, where we are getting the data from. We put the authentication down, which is usually set by default to SigV4 authorization. It uses the IAM role to provide sufficient rights. The last step, possibly the final stage, is creating the visualizations on our dashboards. Before we can do that, let's focus once again on the deployments. On our deployments, we can see there are some labels like namespace and image selectors. You can see that there is an up 2000 and up 2048. On the very basic view visualizations, you can see this is already created. But let us go and edit to see and play a little bit with those filters. We can see the labels that are already visible previously are here. Let's clean it up a little bit. You can refresh and test your queries. See the differences now, before and without that filter. It looks like it is showing some historic data that we would like to filter out because it's making only a burden. This way, we can play more and add more custom and more specific labels for us.

So this way, we can see that we can pretty easily create observability using only fully managed AWS services. Prometheus and Grafana are fully managed by AWS, so it's like with other services—really not our problem to manage them. At this point, we can really focus on our observability, on our SLO, SLA, and SLI. Great. And as Emil said, once we have achieved those metrics, we can ensure that our customer remains happy. Which is ultimately the goal of observability: making sure that we can deliver a quality product. So with that being said, we want to thank you for coming to today's talk. If you wish to connect with either of us, our LinkedIns are on the screen here with the QR code. If you have any other questions, you're welcome to stay and approach either of us. We are very happy to take any questions that you may have. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.