Martin Nanchev for AWS Community Builders

Posted on May 16, 2023

Centralizing Cloudwatch observability - Past, Present and Future

#cloudwatch #observability #oam #aws

What is observability?

According to wikipedia observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

In simple words we measure how the internal lego blocks of a system (AWS services) perform from using their outputs or trying to monitor from customer perspective. Cloudwatch plays a central role in AWS for monitoring, logging, alerting and auditing of a system, but there is one catch. When you have 200 accounts it is really difficult to develop an observability strategy at scale.

Past or cross account observability using Cloudwatch

Dark ages before OAM

You will need to go to every single account and check the dashboards and logs.
Well there is always an alternative to stream Cloudwatch logs using Kinesis firehose and store them in S3. Afterwards you create a schema from the S3 logs using Glue and query them via Athena. If your bucket was encrypted with KMS and you forgot to switch off the option to decrypt them, this can lead to unexpected raise in the bill
You can always implement Opensearch to store the logs

To summarise a cross account observability was difficult without OAM (although you can use a cross account role to putTraces from ECS tasks for X-ray)

The Present - OAM

Last year a new feature was presented called Observability Access Manager. The idea is to create a sink account, which will receive the cloudwatch logs, metrics, traces. (it will use sharing) Then each source account will create a link to the sink account to share the cloudwatch logs, metrics, traces.
Costs: it is free for logs, metrics, but not for traces. Copies of the traces will be billed. And there is a standard fee for creating of Dashboards, alarms etc.

How this will look like if we have multiple accounts?

There are two sink accounts to collect observability data - for production and non-production accounts
There are four accounts to share data - development and qa, which share data with the monitoring non-production account. Production and staging, which share data with the monitoring production account.

To show it in more graphical way:

Everything sounds perfect, but can we automate the OAM sink and sources, that share observability data?

First we will need to create a sink using terraform in the central monitoring account. The accounts will collect observability data from sources:


locals {
  resource_types = coalescelist(var.services,["AWS::Logs::LogGroup",
    "AWS::CloudWatch::Metric",
  "AWS::XRay::Trace"])
}

resource "aws_oam_sink" "sink" {
  name = var.name
  tags = var.tags
}


resource "aws_oam_sink_policy" "observability_data_sink_policy" {
  sink_identifier = aws_oam_sink.sink.id
  policy          = data.aws_iam_policy_document.sink_policy.json
}

data "aws_iam_policy_document" "sink_policy" {

  dynamic "statement" {
    for_each = length(var.account_list) > 0 ? [1] : []
    content {
      actions = ["oam:CreateLink", "oam:UpdateLink"]
      principals {
        type        = "AWS"
        identifiers = var.account_list
      }
      resources = ["*"]
      effect    = "Allow"
      condition {
        test     = "ForAllValues:StringEquals"
        values   = local.resource_types
        variable = "oam:ResourceTypes"
      }

    }
  }

  dynamic "statement" {
    for_each = length(var.org_unit_list) > 0 ? [1] : []
    content {
      actions = ["oam:CreateLink", "oam:UpdateLink"]
      principals {
        type        = "*"
        identifiers = ["*"]
      }
      resources = ["*"]
      effect    = "Allow"
      condition {
        test     = "ForAllValues:StringEquals"
        values   = local.resource_types
        variable = "oam:ResourceTypes"
      }
      condition {
        test     = "ForAnyValue:StringEquals"
        values   = var.org_unit_list
        variable = "aws:PrincipalOrgPaths"
      }
    }
  }
}


variable "name" {
  type        = string
  description = "The name of the observability access manager sink, which collects observability data - logs, metrics, traces"
  default     = "monitoring-sink"
}

variable "tags" {
  type        = map(string)
  description = "Tags to be added to the specific resource"
  default     = {}
}

variable "org_unit_list" {
  type        = list(string)
  description = "list of Organizational units"
  default = [
    "o-aaausawxze/ou-tv7w-dd1211vs",
  ]
}

variable "account_list" {
  description = "List of accounts"
  type        = list(string)
  default     = ["123456789102"]
}

variable "services" {
  type        = list(string)
  default     = []
  description = "List of services to be shared. Possible values are: AWS::Logs::LogGroup, AWS::CloudWatch::Metric, AWS::XRay::Trace"
}

The resourced based policy specifies, that OU or account id could be defined as sources and share data with the sink

After the sink is ready. We can output the arn of the OAM:
output "sink_arn" { value = aws_oam_sink.sink.id }

The arn will be needed by the link (source) to connect to the sink (monitoring account). A data terraform remote state could be used to get the output and use it from a tf configuration file.
A simple hardcoded example will look like:

resource "aws_oam_link" "source_to_sink" {
  label_template  = "$AccountName"
  resource_types  = local.resource_types
  sink_identifier = var.sink_identifier
  tags            = var.tags
}
locals {
  resource_types = coalescelist(var.services, ["AWS::Logs::LogGroup",
    "AWS::CloudWatch::Metric",
  "AWS::XRay::Trace"])
}

variable "sink_identifier" {
  description = "Sink identifier"
  default     = "arn:aws:oam:eu-west-1:123456789102:sink/ed278766-dc6f-4417-ae11-8bf09e9dc329"
  type        = string
}

variable "tags" {
  type        = map(string)
  description = "Tags to be added to the specific resource"
  default     = {}
}

variable "services" {
  type        = list(string)
  default     = []
  description = "List of services to be shared. Possible values are: AWS::Logs::LogGroup, AWS::CloudWatch::Metric, AWS::XRay::Trace"
}

How will the monitoring account look like after it receives the data?

A standard Cloudwatch dashboard could be:

Then if we use the cross account metrics from the central monitoring account, it will look the same but there is a cross account check on the right side of the metric. Also on the metric itself you will receive a label with the account name:

Then all available accounts, which share data with the monitoring account or sink are visible in the Cloudwatch Settings menu:

Future

The present look awesome. You can share observability data, create a centralized dashboards in monitoring account and copy traces to the centralized account.

A possibility to create more dynamic dashboards will look awesome. At the moment I use json to create a template, which is then deployed via terraform. There is a catch that you will need to add every new resource. Well there is the resource explorer, but it still lacks some of the features, that I would like to have.
Also with the Generative AI waiting behind the door, I would expect a suggestions like which metrics will be important for example for Kafka and automatic dashboard suggestions, although you can do a research and define some baseline metrics for monitoring of the cluster and brokers. Also this is really difficult, because you receive building blocks to create architecture -> You build it, you own it. A plan for observability will be needed, before the workload lands in production.

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

DEV Community

Centralizing Cloudwatch observability - Past, Present and Future

What is observability?

Past or cross account observability using Cloudwatch

The Present - OAM

Future

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Top comments (0)

Read next

Announcements from Matt Garman Keynote at re:Invent 2024

La Evolución de Amazon S3: De una Arquitectura Tradicional a ShardStore en Rust

My 2024 (and 3rd year) as an AWS Community Builder

Guide to AWS Certifications: Choosing the Right Path for Your Role