Kazuya

Posted on Dec 5, 2025 • Edited on Dec 7, 2025

AWS re:Invent 2025 - Secure Amazon ECS observability with CDK and Grafana (DEV338)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Secure Amazon ECS observability with CDK and Grafana (DEV338)

In this video, Chibuike Nwachukwu presents a secure observability solution for Amazon ECS using CDK and Grafana. He explains how he addressed the challenge of enabling non-technical healthcare staff to monitor application logs without relying on CloudWatch. The architecture leverages AWS Client VPN for secure access, ADOT sidecars for collecting metrics and traces to AWS X-Ray and Prometheus, and FireLens for sending logs to Loki. Key highlights include implementing security-first design with private subnets, using OpenTelemetry for vendor-neutral instrumentation, and automating infrastructure deployment with AWS CDK and GitHub Actions. The demo showcases auto-instrumentation, custom metrics collection, and error tracing through Grafana dashboards, demonstrating how the system identifies and diagnoses a simulated 404 error in DynamoDB operations.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Challenge: Making Application Observability Accessible to Non-Technical Users

Good day everyone. I'll be giving a talk on secure Amazon ECS observability with CDK and Grafana. As I mentioned, my name is Chibuike Nwachukwu, so let's begin. You may ask yourself what prompted this talk and why I came up with this particular topic. Sometime in June, I sat down with my CTO and he came to me with a particular problem. He said, "Chibuike, we're having our non-technical staff having issues seeing what's happening within our applications."

To give you context, I worked at a healthcare startup, so they wanted to see how patients were signing up and interacting with doctors during live calls. However, these non-technical users had to interact with CloudWatch to see logs, and they were having issues doing that. They had to always go back to the technical team to really understand what was happening. So he really wanted me to try and fix this particular problem and explore the open source space to come up with a solution that would enable non-technical users to see live happenings, logs, and other metrics. I went down the rabbit hole of figuring out the particular tools I should use. I chose Loki and Grafana. The reason behind this was to find a solution that complemented CloudWatch but still gave me the opportunity to have real insight, which was a very important challenge that the CTO mentioned.

I saw that Grafana had beautiful, interactive, and intuitive dashboards that allow non-technical users to really explore and understand what's happening. It ensured vendor independence and had open source flexibility. So we went on to architect and build the whole Grafana Loki dashboard, and we have non-technical users actually seeing what's happening. To go beyond what was required at work, I wanted to have what I call the full observability framework. At that point, we were already focusing on logs. I wanted to add metrics and traces. What better way to do that than with open source, but also to follow the Infrastructure as Code approach? I wanted to use AWS CDK, and based on my own experience, I found that using Infrastructure as Code actually reduced development time and time to production.

What are we going to be looking at today? This is a high-level architectural diagram. We'll go deeply into this within the talk, but to give you an idea, we're going to be talking about security, AWS Client VPN, Amazon ECS, Prometheus, and X-Ray for application signals. Who exactly am I? As I mentioned, I'm Chibuike Nwachukwu, a proud Nigerian. I work as a full stack engineer at Micro1.ai, building software solutions for clients. Sometime in 2021, I became tired of always sitting in front of my laptop, and I became a certified chef. I had to add that image because most of my friends actually doubted that I could be a certified chef.

This is my quest for knowledge. I went on to take up the AWS certification challenge for two years and four months, became fully certified, and got the AWS golden jacket. You'll find me generally writing articles, traveling, and speaking across the world. What are we going to be exploring today? We have seen what actually prompted this talk and why this talk exists. We will next go into observability in detail, the AWS way. Next, we'll look at the three core pillars that actually guided this talk: security, CDK, and Grafana. Then we'll look at a demo on how the application works. Finally, we'll look at actions and resources for us to further explore on our own.

Before we really go into this talk, we need to understand exactly what observability is. I see it as the ability to see what's happening within a system from external outputs. It gives us a better view into the inner workings of a particular system without actually being inside our system. It allows us to ask questions about how our application is actually responding to user interactions or what exactly is happening in our application. Three main core pillars in terms of how you emit telemetric signals are logs, metrics, and traces.

Observability is a very broad topic, and that's why we have OpenTelemetry. That's a standard way of instrumenting, a framework for instrumenting, collection, and exporting of telemetry data. Why it's very important is that it's very unique to our application or the general application of observability data. It allows us to instrument once within our application and send to multiple different locations. It could be AWS services like X-Ray or Amazon Managed Prometheus, or it could be other third-party AWS partners. Due to the actual use of OpenTelemetry, we see it as the second most popular project in the CNCF space, just shortly after Kubernetes, and it's really community-driven and open source.

So what are the three core pillars of observability? I mentioned logs, metrics, and traces. What exactly are logs? We all build applications likely, and we see logs as mostly the default. Logs allow us to know what exactly happened within our system.

We see it as a time-stamped text record of events. Metrics allow us to understand how many times a particular event occurred within our applications. You see it basically as a quantitative measurement on a particular service. Tracing, on the other hand, gives us a more holistic view to tell us how a particular incident took place within our application. I call it the ability to show us the trace of requests that flows throughout our system.

Building a Secure Observability Framework with AWS ADOT, CDK, and Grafana

So we have now heard about observability. We see how OpenTelemetry comes to help make observability more accommodating. But how does AWS come to play in this particular space? AWS has what they call AWS Distro for OpenTelemetry, which is also called ADOT for short. Basically, what ADOT does is provide a secure, production-ready, open-source distribution supported by AWS. If you are running it on Amazon ECS, it will run as a sidecar close to your application. If it is on Lambda, it will run as a Lambda layer.

It allows us to auto-instrument our agents by adding slice code snippets, and then it automatically adds a lot of details and pushes it down to a particular AWS service. So it is a very simple and simplified way of deployment of pushing metrics, logs, and traces to AWS services. Like I mentioned, it collects traces, it collects application metrics, and correlates them and allows us to move them to multiple AWS partner solutions or AWS X-Ray.

Let us go into details on how this architecture works. The first core pillar I want to discuss is the user. If you look at the right-hand side, we have two particular subnets. We have the private subnet and the public subnet. A user interacts with the public-facing application and sees the public URL and interacts with the login application. The login application has two sidecars at this point. It has the ADOT collector that moves traces and metrics to AWS services, and it also has a sidecar for FireLens that moves logs to Loki, which runs in a particular private subnet.

This same login application also interacts with AWS SQS as well as AWS DynamoDB. The second layer is when an authorized user, mostly a developer, tries to see what is happening within the Grafana space. They connect using the AWS Client VPN through the private subnets, and they can then access the Amazon ECS, Grafana, and Loki instances running securely. Both of them are connected to both Amazon EFS and Amazon S3 for persistent storage.

Let us look into the details. First of all, the security-first approach. We mentioned AWS Client VPN, which gives us the ability to have a secure connection and encrypted tunnel to the private subnet. We also implemented what I mentioned earlier, where we have private subnets and public subnets. The idea of segmenting various subnets gives us security and also leverages security groups for least privilege access. IAM rules also leverage to make sure particular services have the required rules to access other AWS services.

I want to dwell more on the availability and security. We have two layers. Like I mentioned, the ADOT sidecar runs close to your application. It collects metrics and traces and moves them to AWS X-Ray and Prometheus. For the first part, it moves it from the sidecar to AWS X-Ray, and this is secured using IAM rules to make sure no one can actually access the logs going through, and that is secured by AWS. The second layer of security happens from within the application whereby it can access the ADOT collector on localhost 1337. Because this connection never leaves the local container network, you do not have interactions from outside the network. This is a secure connection moving traces and metrics through localhost 1337 down to the collector, and then that moves it straight up to AWS services upstream.

This is an example of a FireLens sidecar. The particular Amazon ECS task is defined using task definitions. We see how we add that particular sidecar to our local application. If you look at the URL closely, you will see we are actually accessing a private router with the true network. The local.internal.com can only be accessed when you are connected to a particular VPC. Then we can move logs securely to Loki.

I mentioned AWS CDK and Infrastructure as Code. This allows us to have Infrastructure as Code, improve developer productivity, as well as easily manage resources by just changing lines of code. I want to really explore how dependency sharing works using the AWS CDK. We are going to see an example shortly, and the idea of avoiding circular dependency, which we will also see shortly. Another very important point is that we have Amazon ECR scan on push. We could just have a line of code added to an application.

Once we push code to GitHub to the particular AWS environment, it runs the ECR scan to make sure there are no viruses within the stack. This is an example of access points. If you don't know what access points are, they allow us to have permissions attached to our EFS volumes. I created the access points and imported them into the Kubernetes task definition as a storage method.

GitHub is a source code repository that we use for this project. I want to emphasize that you should use OIDC over using access keys and adding them into your GitHub repository. OIDC allows us to have rules specific to that particular repository and enables us to connect to a particular AWS service. I integrated GitHub Actions for running tests. Once you push the code, you actually run some tests to make sure everything works. The automated flow allows us to have central management when deploying our code.

Demonstration: OpenTelemetry Implementation and Error Tracking in Action

Let's look at how the application works with a very short demo. The first part is importing the OpenTelemetry packages. I tried to add lots of logs so we could see what was happening in real time. In summary, I imported all the necessary packages, and the green checkmarks show that we have everything up and running. I mentioned at the beginning of my talk about auto-instrumentation, and this is a very simple example where adding these particular code snippets gives us instrumentation already added to our application.

For instance, we are using S3 and DynamoDB. With auto-instrumentation, we get detailed metrics and detailed traces that involve both S3 and DynamoDB. However, most times we want to have other custom metrics. We don't just want to focus on the predefined metrics that AWS gives us through auto-instrumentation. We want to have other metrics, perhaps counters or how many times a user comes to our application that we're counting internally. That's why we have the ability to capture custom metrics and metadata within the OpenTelemetry system.

If you look at the right-hand side, we're actually capturing the message queue and saving a particular status of our success. We'll see this in the next slide when we head over to Grafana and look at the Prometheus metrics. Amazon Grafana managed Prometheus allows us to see we are counting and we have two particular results. So what we pushed in the code, we can actually see it happening or see the results on Grafana. We can also go back to CloudWatch to see the traces.

Now we have two options for seeing the traces. We can see the traces in CloudWatch through Application Signals, or we could go back to the Grafana dashboard where we have Application Signals plugged in to see the particular traces. We can really drill down within each trace and see each metadata that each trace actually contains. It's one thing to talk a lot about theory, and it's another thing to actually see the application and how it reacts when we have a particular error or issue. So I tried my best to simulate a 404 error to see how the application responds.

I have a particular endpoint that creates a basic record in DynamoDB. The primary keys are basically the ID and the timestamp. Then I made a GET request to see the ID. But if you look closely, there's a significant change. I'm trying to get the timestamp of latest, and this is not going to match within our DynamoDB, so it's going to give us a 404 error. Once we do that and head over to Application Signals on Grafana, we can see we'll begin to see some red around our greens. That means we're going to have a particular error happening in our application. If we click on that, we see the particular error message is actually a 404 error.

We can actually drill down to understand what exactly is happening. If you look closely on the right-hand side, you can see the main error started happening from when we had the GET DynamoDB call. So we see the local application was working properly, but the red error started happening when we did the DynamoDB get operation. That allows us to have understanding in terms of traces on what actually caused that particular error. We can also go to the metrics to see exactly what's happening.

If you look down, we see the application actually works in the log application, but then it happens. The yellow line shows it happened on the get item request, and we have a not found and we have a success. The issue happened for a not found, which is the 404 error. So we see both metrics and traces allow us to actually see what exactly happened within our application.

In summary, what are the lessons learned using this particular architecture approach? We focus mainly on security, and the key point is to make sure we start with security from the inception and not as an afterthought to actually build it within the application. If you go that other route, you're going to have a lot of problems. The second approach is to use OpenTelemetry. I don't know if you know AWS made a particular announcement two or three months ago that they're actually ditching the AWS X-Ray daemon in favor of OpenTelemetry. This is because OpenTelemetry actually got a lot of traction within the open source community, and AWS wants to go through that same route. So we want to leverage OpenTelemetry for vendor neutrality and because it has a large open source contribution base. Lastly, we want to learn to automate everything. For this example, I used CDK and GitHub. You could use CDK with other tools like GitLab or Bitbucket. This talk stemmed from a two-part technical article I wrote. You can find the GitHub repository for the project as well as the Medium article for the application. It's a two-part Medium article on the right-hand side.

Thank you for coming to my session. You can connect with me on LinkedIn at Chibuike Nwachukwu. Please remember to fill out the survey on the mobile app. If you have any questions, I'll be at the back to answer them. Thank you once more for coming to my session.

; This article is entirely auto-generated using Amazon Bedrock.