Samson Tanimawo

Posted on Apr 26

Observability as Code: Managing Dashboards and Alerts with Terraform

#terraform #observability #devops #iac

The Problem with Click-Ops Dashboards

Your team has 200 dashboards. You don't know who owns them. Half are broken. The rest show yesterday's reality.

This is click-ops debt, and it compounds faster than code debt.

Observability as Code

Every dashboard, alert, and SLO definition should live in a Git repository alongside your service code.

resource "datadog_dashboard" "api_gateway" {
title = "API Gateway - Golden Signals"
description = "Owner: @platform-team"
layout_type = "ordered"

widget {
timeseries_definition {
title = "Request Rate (per second)"
request {
q = "sum:api.requests{service:gateway}.as_rate()"
}
}
}

widget {
timeseries_definition {
title = "P99 Latency"
request {
q = "max:api.latency{service:gateway}.as_count()"
}
}
}
}

This lives next to main.tf for your service. When you deploy the service, you deploy the observability.

Benefits That Compound

1. Ownership is clear. The file has a CODEOWNERS entry. PRs require review.

2. Dashboards auto-update. Renaming a service? Terraform refactor propagates to all dashboards.

3. Drift detection. Someone clicked "save as" in the UI and now that dashboard is out of sync. terraform plan catches it.

4. Review before production. Alert changes go through PR review. No more "who set this threshold?"

Tooling by Platform

datadog:
provider: DataDog/datadog
resources: datadog_monitor, datadog_dashboard, datadog_slo

grafana:
provider: grafana/grafana
resources: grafana_dashboard, grafana_alert_rule

prometheus:
approach: YAML files in Git, deployed by ArgoCD
resources: alert rules, recording rules

new_relic:
provider: newrelic/newrelic
resources: newrelic_alert_policy, newrelic_dashboard

Pick one source of truth. Don't mix.

A Real Example

We have a module that takes a service name and generates a complete observability stack:

module "service_observability" {
source = "./modules/observability"

service_name = "payment-processor"
team_slack = "#payments"
severity_map = {
error_rate_pct = 1.0
p99_latency_ms = 500
saturation_pct = 80
}

slo_targets = {
availability = 0.9995
latency_p99 = 0.99
}
}

One module call creates: 3 dashboards, 8 alerts, 2 SLOs, a Slack channel binding, and a PagerDuty escalation policy.

The Hardest Part

The code is easy. The hard part is:

Migrating existing click-ops dashboards budget 2 weeks
Getting engineers to edit YAML/HCL instead of the UI budget 3 months of reminders
Blocking UI edits some tools let you set dashboards to read-only
Reviewing alert changes PR reviewers need context

The Anti-Pattern to Avoid

Don't write Terraform for every custom chart an engineer wants. That leads to 500-line dashboard modules nobody understands.

Instead, define standard dashboards (golden signals, RED/USE, SLO burn rate) as modules. Let engineers add their own custom dashboards in the UI if they want, but mark them as "explore-only" (not alert-worthy).

Core observability = code. Experimental exploration = UI.

Migration Strategy

Week 1: Pick 1 service, convert its dashboards to Terraform
Week 2: Add alerts + SLOs to Terraform
Week 3: Delete the UI versions
Week 4: Create a module from the patterns
Month 2: Roll out to 10 more services
Month 3: Require all new services to use the module

Six months in, your click-ops debt is gone and your observability is reproducible.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community