Indika_Wimalasuriya for AWS Community Builders

Posted on Jan 17

Datadog: Observability Lessons from 50+ AWS Apps

#datadog #aws #observability #sre

This post shares 15 lessons learned while enabling observability and reliability using Datadog across 50+ large-scale AWS hosted applications. Post covers what worked, what mattered, and what actually improved customer experience.
For a quick background: over the last few years, I have been involved in setting up observability where almost every app was hosted in AWS. These included frontend-facing apps, middleware, backend apps, web and mobile, all of which were distributed with complex dependencies. Most of the apps were direct customer-facing, while others supported critical internal operations. These apps were mainly in the Telco, Media, and Banking & Finance business domains. Now let me get into our topic right away. While following is a nice list, some of these lessons I learned the hard way.

Lesson 1 - Datadog goes beyond observability, it’s a reliability tool

While I’m calling myself an Observability practitioner, I’m very much an SRE. My end goal is to enable world-class customer experience for end users. In order to do that, I rely heavily on Site Reliability Engineering (SRE) concepts. In the world of SRE, there are a few pillars we are focusing on:

Architecture– Reliability comes from strong architectures and design patterns
Observability – Full-stack visibility across systems
SLI/SLO & Error Budgets – Measuring customer experience
Release & Incident Engineering – Treating operations as a software problem
Automation – Eliminate, reduce, simplify, and automate
Resilience Engineering – Chaos engineering and failure testing
People & Awareness – The human factor in reliability

What this means is Observability is a key pillar of the grand scheme of reliability engineering. We enable Observability so we can measure customer experience. If we can measure customer experience, more often than not when it gets falling down, we’ll know how to isolate the root cause quickly and resolve it quickly. Of course, eliminate it promptly if possible. Datadog supports you in all the above pillars. That is why I’m calling it more of a reliability-enhancing tool instead of just an observability tool.

Lesson 2 – Datadog is your partner: Observability is a journey

Generally, we start with keeping the lights on, making systems observable, then making things correlate, and enabling AIOps. It’s a journey. I have publish a complete guide to the AWS observability maturity model V2. Datadog is well equipped to enable this journey for you. It has the capabilities.
Generally, people like to start from Infrastructure visibility; these days we are heavily into AWS Lambda, AWS ECS, or AWS EKS, or a combination of all of these. Datadog provides integrations to enable infrastructure visibility for you.
Once you have the Infrastructure visibility, you can use Datadog capabilities to enable Logs, Metrics, and Traces. This will ensure you have observability for your apps. Datadog Service Catalogues and Systems will allow you to bring it together so correlation is prompt. Datadog enables Metric Anomaly detection, Metric Forecasting, and Log anomalies to keep you one step ahead of the game. Use Watchdog, it will look at your entire service scope to identify anomalies for you. Datadog enables full-stack visibility across your entire AWS estate—from code and infrastructure to the business perspective.

Lesson 3 – Datadog SLOs – What drives it is the ability to measure customer experience

I like Observability as a byproduct of trying to achieve the ability to measure customer experience. I’m generally thinking about bringing in Service Level Indicators (SLIs) for any app, then converting them to build Service Level Objectives (SLOs). Once you enable Application Performance Monitoring (APM) with Datadog and you have logs, metrics, and tracers, it's about building an SLI dashboard—a single truth dashboard for your system. Then convert it to meaningful SLOs in Datadog. Datadog provides three types of SLOs:

By count – measure SLOs with good events divided by total events.
By monitor uptime – using a synthetic test to gauge uptime.
By time slices – using custom uptime definitions.

Our goal is to go through the Observability journey initially targeting having the ability to build Datadog SLOs. If you have SLOs, you already measure customer experience and you're way ahead of the game.

Lesson 4 – Datadog Real User Monitoring (RUM) – You need to know what the heck your end users are doing

Observability is great; it lets you have an understanding of your system's internal state. While it’s good, you need to know what your end users are doing. That’s why RUM comes into play. Not only it shows all metrics related to end-user experience, but capabilities such as Session Replay allow you to watch what customers are doing. When a customer complains something is not working, you're a few steps away from finding what that was using Datadog RUM.

Lesson 5 – Datadog loves when you enhance inbuilt telemetry with code changes

While we love Datadog because it enables most things without any code changes, when you do them, it greatly benefits. Like inserting encrypted important details with Sessions so when you're troubleshooting with Datadog RUM, you can filter with user details, product details, etc. Going slightly beyond has massive benefits. APM is the same as well. If there are deep corners where you're not getting that detail, try to do a little bit of code changes. You will see the magic.

Lesson 6 – There are all kinds of monitors provided by Datadog; use them wisely

In a high level, I see them as:

Infrastructure & Host Reliability: Metric, Host, Process Check, Live Process, Service Check, Change, Integration
Application Performance & Error Detection: APM, Error Tracking, Anomaly, Outlier, Forecast, Composite
User Experience & Frontend Reliability: Real User Monitoring, CI & Tests, Network Check
Logs, Events & Operational Intelligence: Logs, Event, Watchdog, LLM Observability
Network & Dependency Reliability: NDM NetFlow
Reliability Objectives & Governance: SLO
Observability Data Quality: Data Quality (preview)

Lesson 7: Datadog scorecards for observability governance

We would like to define Datadog systems, leverage Datadog service catalogues, and then enable Datadog scorecards. It’s a great automatic way to measure where you are. In-built capabilities are great and you can always expand with customizations using provided APIs.
Datadog scorecards cover:

Observability Best Practices: Ensures services emit the right signals by validating deployment tracking, log ingestion, and log–trace correlation so changes and runtime behavior are fully observable.
Ownership & Documentation: Confirms every service has clear ownership through defined teams, contacts, code repositories, and documentation to enable fast escalation and effective incident response.
Production Readiness: Verifies services are operationally ready for production by checking recent deployments, active monitors, on-call coverage, and defined SLOs.

Lesson 8: Build your incident management with Datadog On-Call & Incident Management

Datadog On-Call is your one-stop place for incident and escalation management. You can define teams, on-call details, and escalations. It will do on-call alerting and provide a lot of good metrics. Initially, when you start, you will see a lot of noise, but over a period of time, you can cut it down to a bare minimum. If you are in Datadog, there is no other on-call management tool you need. Datadog Incident Management allows you to create incidents and track them for closure. You can measure Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR) easily with on-call and incident management.

Lesson 9: Datadog synthetic tests to proactively test your AWS infrastructure

You get telemetry only when your end users are using the system. Synthetic tests enable us to test our application by mimicking end users. It’s not just a URL test; you can use Datadog capabilities to automate your smoke tests easily. Datadog provides great locations; you can initiate your tests across the world too.

Lesson 10: Datadog CI visibility and software changes – Keep track of what the developers are doing

Integrating your pipeline will let Datadog know what teams are deploying to production. By enabling deployment version tracking at Datadog APM, you can compare releases and response times using different releases. Make actions using those insights proactively.

Lesson 11 – Datadog workflow automations – Great way to automate remediate solutions

Datadog workflow automations is a solid place where you can build complex remediation solutions. It will allow you to automate tedious tasks and let monitors kick off some of these. It’s the first step to automate your job away. Datadog Workflow Automation has integrations with almost all AWS services. It’s a great way to automate AWS infrastructure and other operational workflows.

Lesson 12 – Datadog code security – yes, it’s a capability you can use to make your AWS based systems secured

Libraries (SCA), Static Code (SAST), Runtime code (IAST), Secret scanning (Secrets), and IAC, Datadog code security has really good capabilities to keep you secure. All you have to do is integrate your code base with Datadog code security. That is the first step to get the help you need from Datadog.

Lesson 13 – Datadog AI Observability – You will use this heavily in future

Every system is now getting integrated with LLMs. So you need a way to measure those AI performances. It’s a great capability to get full-stack AI observability into your systems now.

Lesson 14: Datadog Bits AI – SRE Agent, your new on-call team mate

Datadog has released Bits AI SRE Agent and it’s awesome. If you're still reading this, it's now available, and it has some great capabilities. It will accelerate root cause analysis to a few short minutes. It makes sense; when Datadog has access to your entire telemetry data, internal system state, what your end users are doing, and how your code is working, it’s able to use those data to identify root causes much faster. What I have seen is it’s having a great capability to correlate things much faster.

Lesson 15 – Datadog UI – it’s the best UI in the town – it provides business visibility to everyone

Datadog UI is great. It’s simple, it's easy, it simplifies complexity in a really cool way. It lets all your stakeholders from SREs, to developers, to senior business executives, or even CTOs to use it easily. There is a persona to be built for everyone. This is a game changer since you can now open business visibility to everyone in your organization.

These are some of the great lessons I have learned. There are many more, but I think it’s time to stop the list. Datadog is a great observability partner for AWS with built integrations. Give it a try with a 14-day Datadog free trial. Yes, it’s expensive, but I have seen it’s worth every penny. If your goal is not just visibility, but reliability at scale in AWS, Datadog provides the tooling—and more importantly, the operational leverage—to get there

DEV Community

Datadog: Observability Lessons from 50+ AWS Apps

Top comments (0)