Indika_Wimalasuriya for AWS Community Builders

Posted on Jan 29 • Edited on Feb 1

Datadog + AWS: Observability Maturity Model 2026

#datadog #aws #observability #sre

AI is transforming the way we work at an unprecedented pace, more like a high speed train than a gradual evolution. As systems become more dynamic and autonomous, the way we think about observability must evolve just as quickly. When I revisited my observability maturity model from last year, it was clear: it no longer reflects today’s reality. The assumptions we made even a year ago are already outdated. So I decided to take another pass and propose a new approach—one that aligns with AI driven systems and modern cloud environments.

As with my previous work, this model is framed around AWS, and I reference Datadog for implementation examples due to its mature and comprehensive observability capabilities.

The previous observability maturity model

Last year, the observability journey looked something like this:

Monitored – Keeping the lights on
Observable – Deeper insights
Correlated – A holistic view
Predictive – Proactive monitoring
Autonomous – Intelligent automation

That model made sense at the time. But today, I believe the “Monitored” stage no longer qualifies as a baseline. Simply knowing that systems are up is no longer enough, not in a world of distributed architectures, rapid deployments, and AI assisted operations.

A revised observability maturity model

The new baseline shifts expectations upward. Observability must start with context, not just metrics:

In the sections that follow, I’ll break down each stage, explain what changes in an AI driven environment, and show how these concepts can be implemented in practice.

Operational Observability

Observability is no longer optional or something to “add later.” It must exist from day one. Anything less is simply not acceptable in modern, AI driven environments. Observability provides the telemetry foundation that powers AIOps, automation, and intelligent decision making. Without high quality signals flowing early, downstream capabilities—context, intelligence, and autonomy cannot exist. This is why observability must sit at the forefront of system design, not as an afterthought. At this stage, the goal is enablement, not maturity. We focus on ensuring that the right telemetry is consistently captured and flowing as soon as workloads are deployed. The emphasis is on coverage, standardization, and reliability of signals—not advanced analytics or automation.

A practical implementation on AWS using Datadog typically includes,

Enable Datadog APM for compute platforms such as EC2, ECS, EKS, and AWS Lambda
Enable Real User Monitoring (RUM) for all customer facing frontend applications
Centralize application logs in Datadog to support signal correlation across logs, metrics, and traces
Enable AWS infrastructure metrics to gain baseline visibility into hosts, containers, and managed services
Define standard alerts aligned with the Golden Signals (traffic, errors, latency, saturation)
Implement basic business and service health checks where applicable
Leverage Datadog Scorecards—it’s an ultimate governance framework that supports scale.

At this level, success is measured by signal availability and consistency, not by sophisticated insights. Once observability is operational and reliable, the foundation is in place to move toward contextual and intelligent capabilities.

Contextual Observability

Once observability is operational, the next step is to add context. Raw telemetry alone is not enough. Everyone involved—developers, SREs, and operators—must understand system intent. In an AI driven world, intent is everything. Without understanding why a system behaves the way it does, teams end up reacting to symptoms rather than causes. Contextual observability ensures that telemetry is enriched with change, ownership, dependencies, and business meaning, enabling faster and more accurate decisions. At this stage, observability evolves from visibility to understanding.

A practical approach on AWS using Datadog includes the following capabilities:

Change and deployment visibility : Leverage Datadog CI and deployment tracking to surface changes happening across AWS environments. Change velocity and frequency provide critical context when diagnosing incidents.
Service Level Indicators (SLIs) : Identify, define, and publish SLIs that represent how the system is actually performing. These metrics should be surfaced on a shared dashboard that acts as the single source of truth for application health.
Service Level Objectives (SLOs) and error budgets : Define SLOs and error budgets and visualize them in dashboards. This establishes a clear, shared definition of what “good” looks like—for both the business and end users.
Service maps and dependency visualization : Use Datadog Service Maps (available once APM is enabled) to simplify the complexity of distributed systems and make dependencies explicit.
System and software catalog : Build on Datadog’s system and software catalog to centralize metadata such as ownership, environments, runtime details, and dependencies. This creates a powerful control plane for managing systems at scale.
Comprehensive monitoring and alerting : Leverage Datadog’s wide range of monitor types to build a holistic monitoring and alerting strategy that aligns with service health and business impact.
Synthetic monitoring : Use Datadog Synthetic tests—browser based, API, and mobile—to simulate real user behavior and validate system intent continuously.
Security signal integration : Leverage Datadog’s security capabilities, including built in code and runtime security signals, to enrich operational context with security posture.
Incident management and on call integration : Use Datadog On Call and incident management to ensure alerts, context, and ownership are tightly integrated during incidents.
Governance and guardrails : As systems scale, governance becomes critical. Use Datadog Scorecards to enforce standards, surface gaps, and provide guardrails across teams and services.

At this level, success is measured by shared understanding. When incidents occur, teams should immediately know what changed, who owns the service, how it impacts users, and where to focus. This contextual foundation is what enables the transition to Operational Intelligence.

Decision Intelligence

At this stage, observability evolves into intelligence. The goal is no longer just understanding what is happening, but guiding decisions and recommended actions using AI. Decision Intelligence builds on the strong foundations of operational and contextual observability. With high quality telemetry, clear intent, and rich context already in place, systems can begin to explain themselves highlighting what is abnormal, why it matters, and what actions should be considered next. This is where AI guided insights start to meaningfully reduce cognitive load for engineers and operators.

A practical approach on AWS using Datadog includes the following capabilities:

Watchdog (AI powered change detection) : Datadog Watchdog is one of the earliest and most comprehensive AI driven capabilities in the platform. Instead of relying solely on manually configured monitors, Watchdog continuously analyzes APM, RUM, logs, and metrics to detect deviations from normal behavior and surface unexpected changes automatically.
Anomaly detection : Leverage Datadog’s metric and log anomaly detection to identify shifts in baselines and unusual patterns. This helps teams focus on meaningful signals rather than static thresholds.
Forecasting and capacity insights : Use Datadog’s metric forecasting capabilities to anticipate future resource constraints, such as capacity exhaustion or traffic growth, enabling proactive planning instead of reactive firefighting.
Bits AI (incident summaries, RCA, and insights) : Bits AI is one of Datadog’s most recent advancements in agentic AI. It analyzes existing telemetry to generate incident summaries, form and validate hypotheses, and assist with root cause analysis. This significantly accelerates incident response and reduces time to resolution.
SLO risk and burn rate tracking : Define and track SLOs to continuously assess risk and error budget burn rates. This provides a clear, quantitative view of whether systems are delivering the experience they are expected to provide.
Business and user impact correlation : Incorporate business metrics and user experience signals (such as RUM KPIs and XLAs) to correlate technical behavior with business outcomes. These metrics can be translated into SLIs and SLOs, enabling teams to measure success in terms that matter to both users and the business.

At this level, success is measured by clarity and confidence in decision making. Teams are no longer overwhelmed by data; instead, they are guided by AI assisted insights that highlight risk, recommend focus areas, and connect system behavior to real world impact. This sets the stage for the transition to Autonomous Operations, where systems begin to act on these insights automatically.

Autonomous Operations

This is the stage organizations should actively strive to reach. It is where systems begin to operate autonomously, requiring progressively less human intervention while still remaining safe, observable, and governed.

Autonomous Operations are not about removing humans from the loop—they are about elevating human involvement. Engineers shift from manual responders to system designers, defining guardrails, policies, and confidence thresholds that allow systems to act decisively and safely.

Reaching this stage takes effort. It requires strong foundations in observability, context, and decision intelligence. But once achieved, the payoff is significant: faster remediation, reduced operational toil, and systems that can respond to change at machine speed.

A practical approach on AWS using Datadog includes the following capabilities:

Workflow automation as the automation backbone : Datadog Workflow Automation provides a rich set of integrations with AWS and third party tools, making it the primary mechanism for building operational automations. It becomes the control plane for repeatable, policy driven actions.
Event driven remediation : Leverage Datadog events and signals to trigger automated remediations. This event driven approach is one of the most common and effective patterns in AWS based environments.
SLO driven automation : Use SLOs not just for visibility, but as automation triggers. When error budgets are burning or SLOs are breached, workflows can be invoked automatically to initiate remediation actions or escalate to deeper analysis using tools such as Bits AI SRE.
Automated recovery actions : Implement workflows for common corrective actions such as:
- Auto rollback of deployments
- Auto scaling of infrastructure
- Traffic shaping or failover These actions can be executed automatically on AWS using predefined, tested workflows.
Human in the loop safety controls : Automation must always operate within defined guardrails. Approval steps, confidence thresholds, and progressive rollouts ensure that actions are safe, explainable, and reversible. Humans remain in control—automation simply executes faster and more consistently.

At this level, success is measured by resilience and speed. Incidents are resolved automatically or partially mitigated before users are impacted, and human intervention becomes the exception rather than the rule. This sets the foundation for the final stage: Adaptive Operations, where systems continuously learn and improve over time.

Adaptive Operations

Reaching Autonomous Operations is a huge achievement—it’s like a plane flying on autopilot. But true excellence requires more: systems must not only act autonomously, they must also learn, adapt, and withstand stress. This is the final stage of observability maturity, where systems continuously improve and become resilient to changing conditions. At this stage, the focus shifts from reacting or remediating to continuous adaptation and self optimization. Systems evolve based on operational experience, business impact, and AI driven insights, enabling them to prevent issues before they occur and optimize performance over time.

A practical approach on AWS using Datadog includes:

Incident retrospectives and prevention : Combine Datadog Incident Management with Bits AI SRE to analyze incidents, identify root causes, and implement prevention strategies that reduce recurrence.
Continuous alert tuning : Leverage Datadog Watchdog and anomaly detection to automatically adjust alerts based on changing system behavior, ensuring signals remain meaningful and actionable.
Predictive SLO management : Use forecasting and historical trends to anticipate SLO risks and preemptively adjust systems, workloads, or resources before they impact users.
Self healing workflows : Integrate Datadog Scorecards, Bits AI SRE, and Workflow Automations to implement closed loop remediation and optimization. This enables AWS workloads to automatically correct deviations, scale intelligently, and maintain business continuity. At this level, success is measured by resilience, adaptability, and continuous improvement. Systems learn from experience, optimize themselves over time, and maintain business objectives without constant human intervention—truly embodying the vision of AI driven, self managing operations.

I hope this updated maturity model helps you design and operate more powerful, resilient systems. Remember: observability is not an afterthought—it sits at the forefront of the AI revolution.

Maturity begins once telemetry is consistently captured, but it’s truly measured by how much safe decision making authority we can delegate to the platform. And never lose sight of the ultimate goal: Autonomous, Adaptive Operations, where systems continuously learn, optimize, and act with minimal human intervention.

I’m running a hands on video series on Datadog Full Stack Observability on AWS, where you can learn step by step — from beginner to advanced.

DEV Community

Datadog + AWS: Observability Maturity Model 2026

Top comments (0)