Implementing SLO Error Budget Monitoring with AWS Services Only

#aws #cloudwatch #monitoring #sre

In this article, I’ll introduce an SLO error budget monitoring system that I built entirely using AWS services.

System Overview

This system is based on the multi-window, multi-burn-rate alerting method proposed in the Site Reliability Workbook. It sends alerts to Slack when the SLO error budget burn rate exceeds a certain threshold.



expr: (
        job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001)
      and
        job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001)
      )
    or
      (
        job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001)
      and
        job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001)
      )
severity: page

expr: (
        job:slo_errors_per_request:ratio_rate24h{job="myjob"} > (3*0.001)
      and
        job:slo_errors_per_request:ratio_rate2h{job="myjob"} > (3*0.001)
      )
    or
      (
        job:slo_errors_per_request:ratio_rate3d{job="myjob"} > 0.001
      and
        job:slo_errors_per_request:ratio_rate6h{job="myjob"} > 0.001
      )
severity: ticket

https://sre.google/workbook/alerting-on-slos/

The Site Reliability Workbook recommends combining multiple time windows and burn rates to create more accurate alerts. For example, if you only alert based on a short time window, you may get too many false positives. On the other hand, if you rely on a longer time window, it may take too long for an alert to trigger or recover. By using both short and long windows, the system can generate more precise alerts.

For more detailed explanations of this alerting method, I encourage you to check out the Site Reliability Workbook, which is available for free. Personally, I believe that strictly following these guidelines is the best way to monitor service reliability.

Why I Implemented This

The primary motivation behind this monitoring system was to solve the problem of "alert fatigue." As an operations engineer, I was constantly bombarded with alerts that didn’t actually require any action.

Previously, we monitored CPU and memory usage metrics directly, which led to alerts even when the service itself was running smoothly. For example, we would receive a Slack alert if the CPU usage of a web server exceeded 90%, regardless of whether there were any issues with availability or latency.

However, even if the CPU usage was high, if the service remained available and responsive, there was no need for intervention. Over time, receiving alerts for issues that didn't require action caused fatigue and increased the risk of missing critical alerts. Missing an important alert could damage service reliability and lead to a loss of users.

To address this, I decided to focus only on critical indicators—such as availability and latency—and set up an SLO-based monitoring system that tracks the error budget burn rate.

Why I Used AWS Services Only

At first, I considered using third-party monitoring tools like New Relic or Datadog. However, after investigating these tools in early 2023, I found that it was either impossible or very difficult to implement the kind of SLO monitoring I needed with their existing features. I worked with support teams from both platforms, but we couldn't come up with a viable solution.

On the other hand, by combining AWS services such as Amazon CloudWatch, I could implement the necessary monitoring and at a much lower cost—around $20 per month.

Since most of our workloads are hosted on AWS, building the monitoring system directly on AWS made it easier to integrate everything into a single console. AWS also offers a high degree of flexibility in terms of combining services, giving us more room to scale and customize compared to third-party tools like New Relic or Datadog.

Architecture

Here’s an overview of the system’s architecture:

Although it may look a bit complex at first glance, the core workflow is simple:

EventBridge Scheduler triggers a Lambda function at regular intervals.
The Lambda function runs an Athena query to aggregate ALB access logs.
The results are stored in both CloudWatch custom metrics and RDS.
CloudWatch composite alarms use these metrics to trigger alerts based on the SLO conditions.

The Lambda function is triggered by EventBridge instead of an S3 file save event. This is because when there is heavy traffic, multiple log files are saved to S3, which could result in the aggregation process being run multiple times unnecessarily.

We use RDS to store the aggregated results in order to reduce Athena scan costs. For more details on these optimizations and custom metric definitions, please refer to my presentation at JAWS PANKRATION 2024.

Results of the Implementation

Since we introduced this system, alert fatigue has significantly decreased. I personally noticed a reduction in the number of unnecessary alerts that required no action.

Additionally, the development team has become more invested in the reliability of the services. Now, when an alert is triggered, the team takes the initiative to tune performance or even revise the SLOs when necessary.

As of September 2023, this system is monitoring 7 services and 12 critical user journeys.

Operational Tips

To streamline operations, I developed an internal Terraform module that allows us to quickly accommodate additional monitoring requests from the development team. While the module itself is private, this is an actual example of its interface, which could be useful as a reference if you plan to implement a similar module.



module "slomon" {
  source = "git@github.com:enechange/terraform-modules.git//slomon-for-alb?ref=v0.53.0"

  environment_name              = "prod-enechange"
  alb_access_logs_s3_url        = local.alb_access_logs_s3_url
  sns_topic_names_for_paging    = ["cto-incident-enechange"]
  sns_topic_names_for_ticketing = ["cto-alert-enechange"]

  critical_user_journeys = {
    input1_submit = {
      http_method     = "POST"
      path            = "/try/input1_submit"
      dashboard_order = 1

      slo = {
        availability_target   = 95.0
        latency_p95_threshold = 4.0
        latency_p50_threshold = 3.0
      }
    }
  }
}

This Terraform module also includes Athena queries for analyzing historical data, allowing us to quickly set SLOs based on past performance.

Future Outlook

At present, this monitoring system meets all of our needs. However, if CloudWatch's Application Signals feature evolves to support the required conditions for SLO monitoring, this custom system may no longer be necessary. In that case, we would likely transition to a fully managed solution.

Likewise, if New Relic or Datadog introduce features that support the kind of SLO monitoring we need, we may consider switching to them. These APM tools generally offer better capabilities for diagnosing root causes of performance degradation compared to CloudWatch.

Ultimately, we will continue to assess the best solution for our needs and adapt our system as new tools and services emerge.