The Problem with Click-Ops Dashboards
Your team has 200 dashboards. You don't know who owns them. Half are broken. The rest show yesterday's reality.
This is click-ops debt, and it compounds faster than code debt.
Observability as Code
Every dashboard, alert, and SLO definition should live in a Git repository alongside your service code.
resource "datadog_dashboard" "api_gateway" {
title = "API Gateway - Golden Signals"
description = "Owner: @platform-team"
layout_type = "ordered"
widget {
timeseries_definition {
title = "Request Rate (per second)"
request {
q = "sum:api.requests{service:gateway}.as_rate()"
}
}
}
widget {
timeseries_definition {
title = "P99 Latency"
request {
q = "max:api.latency{service:gateway}.as_count()"
}
}
}
}
This lives next to main.tf for your service. When you deploy the service, you deploy the observability.
Benefits That Compound
1. Ownership is clear. The file has a CODEOWNERS entry. PRs require review.
2. Dashboards auto-update. Renaming a service? Terraform refactor propagates to all dashboards.
3. Drift detection. Someone clicked "save as" in the UI and now that dashboard is out of sync. terraform plan catches it.
4. Review before production. Alert changes go through PR review. No more "who set this threshold?"
Tooling by Platform
datadog:
provider: DataDog/datadog
resources: datadog_monitor, datadog_dashboard, datadog_slo
grafana:
provider: grafana/grafana
resources: grafana_dashboard, grafana_alert_rule
prometheus:
approach: YAML files in Git, deployed by ArgoCD
resources: alert rules, recording rules
new_relic:
provider: newrelic/newrelic
resources: newrelic_alert_policy, newrelic_dashboard
Pick one source of truth. Don't mix.
A Real Example
We have a module that takes a service name and generates a complete observability stack:
module "service_observability" {
source = "./modules/observability"
service_name = "payment-processor"
team_slack = "#payments"
severity_map = {
error_rate_pct = 1.0
p99_latency_ms = 500
saturation_pct = 80
}
slo_targets = {
availability = 0.9995
latency_p99 = 0.99
}
}
One module call creates: 3 dashboards, 8 alerts, 2 SLOs, a Slack channel binding, and a PagerDuty escalation policy.
The Hardest Part
The code is easy. The hard part is:
- Migrating existing click-ops dashboards budget 2 weeks
- Getting engineers to edit YAML/HCL instead of the UI budget 3 months of reminders
- Blocking UI edits some tools let you set dashboards to read-only
- Reviewing alert changes PR reviewers need context
The Anti-Pattern to Avoid
Don't write Terraform for every custom chart an engineer wants. That leads to 500-line dashboard modules nobody understands.
Instead, define standard dashboards (golden signals, RED/USE, SLO burn rate) as modules. Let engineers add their own custom dashboards in the UI if they want, but mark them as "explore-only" (not alert-worthy).
Core observability = code. Experimental exploration = UI.
Migration Strategy
Week 1: Pick 1 service, convert its dashboards to Terraform
Week 2: Add alerts + SLOs to Terraform
Week 3: Delete the UI versions
Week 4: Create a module from the patterns
Month 2: Roll out to 10 more services
Month 3: Require all new services to use the module
Six months in, your click-ops debt is gone and your observability is reproducible.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)