infra tools

Posted on Apr 4

We Had 100 Dead Alerts Firing for Services That No Longer Existed. So I Built a Kubernetes Operator.

#kubernetes #grafana #devops #go

TL;DR: I built and open sourced a Kubernetes operator that manages Grafana Cloud dashboards, alert rules, and SLOs as code — with automatic cleanup when services are decommissioned. It solves the "100 orphaned alerts" problem by coupling Grafana resource lifecycle to Kubernetes resource lifecycle.

It was a Tuesday afternoon when someone on the team noticed that Grafana was still sending alerts for a service we'd decommissioned four months ago.

Not one alert. Not five. We found over 100 alert rules in Grafana Cloud that had no corresponding live service. Some went back almost a year. No one cleaned them up — ownership was unclear after teams changed. The alerts just stayed there, quietly firing, quietly getting ignored, quietly eroding trust in the entire alerting system.

That's when I started building the Grafana Cloud Operator.

The Problem With Managing Grafana Manually

If you've worked on a platform team, this scenario is probably familiar. Grafana is great for interactive observability, but those interactive workflows don't guarantee long-term cleanup or versioning. You log in, build a dashboard, tweak an alert — and when you're done, those resources live in Grafana, disconnected from the code and infrastructure they're supposed to be monitoring.

This creates a few compounding problems over time:

Drift. Someone edits a dashboard during an incident at 2am. They add a panel, change a threshold, rename something. That change is never reviewed, never documented, and three months later nobody knows whether the dashboard reflects reality or that one sleep-deprived investigation.

Orphaned resources. Services get renamed, decommissioned, replaced. Grafana doesn't know. Alerts keep firing. Dashboards keep showing flatlines. The noise builds until people stop trusting alerts entirely.

No history. Who changed this alert last Thursday? What did it look like before? Grafana's audit log tells you that something changed, but your Git history — where all your other infrastructure lives — tells you nothing.

We wanted the same workflow for observability resources that we already had for everything else: YAML in Git, PR review, automatic deployment, automatic cleanup.

The Idea: Treat Grafana Resources Like Kubernetes Resources

The solution was conceptually simple. If you can define a Kubernetes Deployment in YAML and have a controller reconcile it to the desired state, you should be able to do the same for a Grafana dashboard or alert rule.

So I built three Kubernetes CRDs — GrafanaAlertRule, GrafanaDashboard, and GrafanaSLO:

# Your alert lives next to your service's Helm chart
apiVersion: monitoring.grafana-operator.io/v1alpha1
kind: GrafanaAlertRule
metadata:
  name: high-error-rate
  namespace: payments-service
spec:
  title: "High Error Rate"
  folder: "payments-service"
  datasourceUid: "grafanacloud-prom"
  condition: "C"
  for: "5m"
  notificationSettings:
    receiver: "payments-pagerduty"
  data:
    # ... Grafana alert query blocks

When you kubectl apply this, the operator syncs it to Grafana Cloud. When the payments-service namespace gets removed, the alert goes with it. No manual cleanup. No orphaned resources.

The alert appears in Grafana Cloud immediately, tagged as operator-managed, with a "This alert rule cannot be edited through the UI" banner — enforcing that all changes must go through Git. The dashboard and SLO resources work the same way, each with their own lifecycle guarantees.

The Part That Actually Solved the Problem: Lifecycle Coupling

The most useful thing the operator does isn't creating resources — it's deleting them.

Every resource the operator creates gets tagged with two things: createdby=operator and cluster=<your-cluster-id>. When a CRD is deleted from Kubernetes (because a service was decommissioned, a namespace was cleaned up, a Helm release was removed), the operator's reconciler fires, checks those tags, and deletes the corresponding resource from Grafana.

No more orphaned alerts. No more dead dashboards.

On top of that, the operator runs a periodic cleanup job (configurable, defaults to every hour) that scans Grafana for any operator-owned resources that don't have a corresponding CRD in the cluster anymore. This catches anything that slipped through — say, if the cluster crashed mid-delete and the reconciler never got to fire.

// Simplified: what happens when a CRD is deleted
func handleAlertDelete(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    uid := grafanautil.GenerateStableUID(req.Namespace, req.Name)

    // Check ownership before touching anything
    existing := fetchAlertFromGrafana(uid)
    if existing.Labels["createdby"] != "operator" {
        return ctrl.Result{}, nil // Not ours, don't touch it
    }
    if existing.Labels["cluster"] != os.Getenv("CLUSTER_ID") {
        return ctrl.Result{}, nil // Different cluster owns this
    }

    deleteAlertFromGrafana(uid)
}

The cluster scoping was important for us because multiple clusters point at the same Grafana Cloud org. Without it, deleting a service from staging would delete its dashboards from production too. Now each cluster only manages resources it created.

Idempotency: Not Spamming the Grafana API

Early versions of the operator would send every reconciled resource to Grafana on every loop, even if nothing had changed. This hammered the API and cluttered the audit log.

The fix was hash-based idempotency. Before making an API call, the reconciler computes a SHA hash of the full payload and stores it in the CRD's status. On the next reconcile, it recomputes the hash and skips the API call if nothing changed:

hash := computeSHA1(payload)
if alert.Status.AlertHash == hash {
    logger.Info("No change in alert rule, skipping sync")
    return ctrl.Result{}, nil
}

A real-world reconciler fires a lot more than you'd expect — on pod restarts, leader elections, periodic resyncs. Without this check, every one of those events would hit the Grafana API unnecessarily.

Multi-Cluster Safety

This one bit us during testing. We had two clusters — prod and staging — both managed by their own operator instances, both pointing at the same Grafana Cloud org.

When we deleted a service from staging, its operator dutifully deleted the Grafana dashboard. The problem: prod had a dashboard with the same name, created by prod's operator, and the stable UID generation (SHA1 of namespace + name) produced the same UID for both.

The fix was the cluster label. Every resource created by the operator is tagged with the CLUSTER_ID environment variable. Before any delete or update operation, the operator checks whether the remote resource's cluster tag matches its own. If not, it skips:

if existingCluster != os.Getenv("CLUSTER_ID") {
    logger.Info("Skipping: resource belongs to different cluster",
        "remoteCluster", existingCluster)
    return ctrl.Result{}, nil
}

Simple, but it saved us from a painful production incident.

Grafana Cloud Support — Including the SLO Plugin

Most open source Grafana tooling is built for OSS Grafana. We run Grafana Cloud with the paid SLO plugin, so the operator needed to support that too.

SLOs have their own API (/api/plugins/grafana-slo-app/resources/v1/slo) which is completely separate from the alert and dashboard APIs. The GrafanaSLO CRD lets you define them declaratively:

apiVersion: monitoring.grafana-operator.io/v1alpha1
kind: GrafanaSLO
metadata:
  name: api-availability
  namespace: payments-service
spec:
  title: "API Availability"
  target: 99.9
  timeWindow: "30d"
  indicator:
    type: ratio
    source: "grafanacloud-prom"
    params:
      good: 'http_requests_total{status!~"5.."}'
      total: "http_requests_total"

Same lifecycle guarantees apply — the SLO gets cleaned up when the service goes away. Grafana auto-generates a companion SLO dashboard when the SLO is created; the operator manages the SLO definition but intentionally leaves that auto-generated dashboard alone.

This is the feature that differentiates this operator most from existing tools — the official grafana/grafana-operator doesn't cover the Grafana Cloud SLO plugin, and alert rule support has been an open feature request since 2022.

Drift Correction: Enforcing Git as the Source of Truth

One of the most satisfying things to build was drift correction — the guarantee that if someone manually edits a resource in Grafana, the operator will revert it.

For dashboards, the approach is hash-based: the operator computes a SHA256 of the dashboard JSON spec and stores it in the CRD status. On the next reconcile, if the hash matches the last sync, nothing happens. If it differs, the operator re-pushes the version from Git, overwriting any manual changes.

To force an immediate revert without changing the YAML, there's a force-sync annotation:

kubectl annotate grafanadashboard my-service-overview \
  force-sync="$(date)" --overwrite -n my-service

The controller watches for annotation changes (not just spec changes) and bypasses the hash check when it sees a new force-sync value. Within seconds, the dashboard reverts to what's in Git.

Alert rules behave differently — they're created via Grafana's provisioning API, which locks them from UI editing entirely. You'll see "This alert rule cannot be edited through the UI" in Grafana. That's intentional — the provisioning API enforces that the operator owns the resource, not the UI.

SLOs occupy a middle ground: the operator manages the SLO definition (target, indicator, burn rate rules), and force-sync will revert any changes to those. The auto-generated companion dashboard is Grafana's territory.

The CLI Generator: Lowering the Barrier to Entry

Writing Grafana alert YAML by hand is painful. The query data structure alone has five nested levels. To make it easier for developers to add alerts to their services without needing to understand the full schema, the operator ships with an interactive CLI generator:

$ go run . --generate-alert

🔤 Alert name: payment-timeout
📂 Folder name: payments-service
📝 Title: Payment Processing Timeout
🔌 Datasource (default: grafanacloud-prom): grafanacloud-prom
➕ Add Step B? (y/n): y
📣 Use contact point? (y/n): y
🔔 Contact point name: payments-pagerduty

✅ Alert YAML written to payment-timeout-alert.yaml

It's not glamorous, but it cut the time for a developer to add a new alert from "ask the platform team" to about three minutes.

Debugging When Something Goes Wrong

If a resource isn't appearing in Grafana or isn't getting cleaned up, these four commands cover 90% of cases:

# Operator logs — shows every reconcile, API call, and error
kubectl -n grafana-operator-system logs deploy/grafana-cloud-operator --follow

# Kubernetes events for a specific CRD
kubectl describe grafanaalertrule <name> -n <namespace>

# All events in the operator namespace, newest first
kubectl -n grafana-operator-system get events --sort-by='.lastTimestamp'

# List all operator-managed CRDs across the cluster
kubectl get grafanaalertrule,grafanadashboard,grafanaslo -A

The most common issues are a wrong GRAFANA_CLOUD_URL (no trailing slash), an API key without Editor permissions, or a CLUSTER_ID mismatch causing the operator to skip resources it thinks belong to a different cluster.

What I'd Do Differently

Read the API spec more carefully before writing code. The field for the rule group name in Grafana's alert payload is ruleGroup, not group. The interval belongs to the rule group, not the individual rule. These look obvious in hindsight but cost hours of debugging against the real API.

Add HTTP timeouts on day one. All the Grafana API calls use a default HTTP client with no timeout. If the Grafana API is slow, the operator goroutine hangs indefinitely. It hasn't caused issues yet, but it's a known gap that should be fixed.

Write integration tests earlier. The current test coverage only hits the dry-run paths. The real reconcile logic — ownership checks, deletion, orphan cleanup — isn't tested against a real API. Given that the operator can delete things in production Grafana, that's not a comfortable place to be.

How It Fits Alongside the Official Grafana Operator

The official grafana/grafana-operator is a solid project and this isn't trying to replace it.

The official operator is built for managing self-hosted Grafana instances on Kubernetes — deploying Grafana itself, managing datasources, plugins, contact points. It's great at what it does.

This operator solves a different problem: managing resources inside an existing Grafana Cloud org — alert rules, dashboards, SLOs — as Kubernetes CRDs with GitOps lifecycle coupling. If you use self-hosted Grafana, the official operator is what you want. If you're on Grafana Cloud and want GitOps observability, this fills the gap.

Try It

The operator is open sourced at github.com/nidhirai968/grafana-cloud-operator. It's v1alpha1 — functional and battle-tested, but there's still plenty to improve.

Getting started takes about 5 minutes:

# Install CRDs
kubectl apply -f config/crd/bases/

# Create credentials secret
kubectl create secret generic grafana-operator-secret \
  --from-literal=GRAFANA_CLOUD_URL=https://your-instance.grafana.net \
  --from-literal=GRAFANA_CLOUD_API_KEY=your-api-key \
  --from-literal=CLUSTER_ID=your-cluster-name \
  -n grafana-operator-system

# Deploy
kubectl apply -f config/rbac/
kubectl apply -f config/manager/deployment.yaml

Then try the full lifecycle in one minute:

# 1. Apply a sample alert rule
kubectl apply -f testdata/alert-rule.yaml

# 2. Confirm it synced to Grafana Cloud
kubectl get grafanaalertrule -A

# 3. Check operator logs to see reconcile activity
kubectl -n grafana-operator-system logs deploy/grafana-cloud-operator --follow

# 4. Delete it — watch Grafana Cloud clean up automatically
kubectl delete -f testdata/alert-rule.yaml

Open your Grafana Cloud instance after step 1 and after step 4 — the alert appears and disappears without you touching Grafana.

If you want to contribute — a Helm chart, integration tests, notification policy support — the CONTRIBUTING.md has the full list of known gaps and good starting points.

The thing I didn't expect when building this was how much calmer incident response became once people trusted that Grafana alerts actually meant something. When you know that every alert in the system is owned by a live service, defined in code, and reviewable in Git — you stop ignoring the noise, because there is no noise.

I'm a platform engineer with 8 years building DevOps and cloud infrastructure. If this was useful, follow me for more posts on Kubernetes, observability, and platform engineering.

Tags: kubernetes grafana devops platform-engineering observability gitops go slo alerting

DEV Community