<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: infra tools</title>
    <description>The latest articles on DEV Community by infra tools (@infra_tools_97d10de984ee0).</description>
    <link>https://dev.to/infra_tools_97d10de984ee0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3838381%2F01522278-aed9-4f4d-8f2a-8438e2c6cc80.png</url>
      <title>DEV Community: infra tools</title>
      <link>https://dev.to/infra_tools_97d10de984ee0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/infra_tools_97d10de984ee0"/>
    <language>en</language>
    <item>
      <title>We Had 100 Dead Alerts Firing for Services That No Longer Existed. So I Built a Kubernetes Operator.</title>
      <dc:creator>infra tools</dc:creator>
      <pubDate>Sat, 04 Apr 2026 18:37:28 +0000</pubDate>
      <link>https://dev.to/infra_tools_97d10de984ee0/we-had-100-dead-alerts-firing-for-services-that-no-longer-existed-so-i-built-a-kubernetes-operator-5e6d</link>
      <guid>https://dev.to/infra_tools_97d10de984ee0/we-had-100-dead-alerts-firing-for-services-that-no-longer-existed-so-i-built-a-kubernetes-operator-5e6d</guid>
      <description>&lt;p&gt;&lt;em&gt;TL;DR: I built and open sourced a Kubernetes operator that manages Grafana Cloud dashboards, alert rules, and SLOs as code — with automatic cleanup when services are decommissioned. It solves the "100 orphaned alerts" problem by coupling Grafana resource lifecycle to Kubernetes resource lifecycle.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;It was a Tuesday afternoon when someone on the team noticed that Grafana was still sending alerts for a service we'd decommissioned four months ago.&lt;/p&gt;

&lt;p&gt;Not one alert. Not five. We found over 100 alert rules in Grafana Cloud that had no corresponding live service. Some went back almost a year. No one cleaned them up — ownership was unclear after teams changed. The alerts just stayed there, quietly firing, quietly getting ignored, quietly eroding trust in the entire alerting system.&lt;/p&gt;

&lt;p&gt;That's when I started building the Grafana Cloud Operator.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Managing Grafana Manually
&lt;/h2&gt;

&lt;p&gt;If you've worked on a platform team, this scenario is probably familiar. Grafana is great for interactive observability, but those interactive workflows don't guarantee long-term cleanup or versioning. You log in, build a dashboard, tweak an alert — and when you're done, those resources live in Grafana, disconnected from the code and infrastructure they're supposed to be monitoring.&lt;/p&gt;

&lt;p&gt;This creates a few compounding problems over time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drift.&lt;/strong&gt; Someone edits a dashboard during an incident at 2am. They add a panel, change a threshold, rename something. That change is never reviewed, never documented, and three months later nobody knows whether the dashboard reflects reality or that one sleep-deprived investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orphaned resources.&lt;/strong&gt; Services get renamed, decommissioned, replaced. Grafana doesn't know. Alerts keep firing. Dashboards keep showing flatlines. The noise builds until people stop trusting alerts entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No history.&lt;/strong&gt; Who changed this alert last Thursday? What did it look like before? Grafana's audit log tells you &lt;em&gt;that&lt;/em&gt; something changed, but your Git history — where all your other infrastructure lives — tells you nothing.&lt;/p&gt;

&lt;p&gt;We wanted the same workflow for observability resources that we already had for everything else: YAML in Git, PR review, automatic deployment, automatic cleanup.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Idea: Treat Grafana Resources Like Kubernetes Resources
&lt;/h2&gt;

&lt;p&gt;The solution was conceptually simple. If you can define a Kubernetes &lt;code&gt;Deployment&lt;/code&gt; in YAML and have a controller reconcile it to the desired state, you should be able to do the same for a Grafana dashboard or alert rule.&lt;/p&gt;

&lt;p&gt;So I built three Kubernetes CRDs — &lt;code&gt;GrafanaAlertRule&lt;/code&gt;, &lt;code&gt;GrafanaDashboard&lt;/code&gt;, and &lt;code&gt;GrafanaSLO&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Your alert lives next to your service's Helm chart&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.grafana-operator.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GrafanaAlertRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high-error-rate&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Rate"&lt;/span&gt;
  &lt;span class="na"&gt;folder&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments-service"&lt;/span&gt;
  &lt;span class="na"&gt;datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grafanacloud-prom"&lt;/span&gt;
  &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C"&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5m"&lt;/span&gt;
  &lt;span class="na"&gt;notificationSettings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments-pagerduty"&lt;/span&gt;
  &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ... Grafana alert query blocks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you &lt;code&gt;kubectl apply&lt;/code&gt; this, the operator syncs it to Grafana Cloud. When the &lt;code&gt;payments-service&lt;/code&gt; namespace gets removed, the alert goes with it. No manual cleanup. No orphaned resources.&lt;/p&gt;

&lt;p&gt;The alert appears in Grafana Cloud immediately, tagged as operator-managed, with a "This alert rule cannot be edited through the UI" banner — enforcing that all changes must go through Git. The dashboard and SLO resources work the same way, each with their own lifecycle guarantees.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Part That Actually Solved the Problem: Lifecycle Coupling
&lt;/h2&gt;

&lt;p&gt;The most useful thing the operator does isn't creating resources — it's deleting them.&lt;/p&gt;

&lt;p&gt;Every resource the operator creates gets tagged with two things: &lt;code&gt;createdby=operator&lt;/code&gt; and &lt;code&gt;cluster=&amp;lt;your-cluster-id&amp;gt;&lt;/code&gt;. When a CRD is deleted from Kubernetes (because a service was decommissioned, a namespace was cleaned up, a Helm release was removed), the operator's reconciler fires, checks those tags, and deletes the corresponding resource from Grafana.&lt;/p&gt;

&lt;p&gt;No more orphaned alerts. No more dead dashboards.&lt;/p&gt;

&lt;p&gt;On top of that, the operator runs a periodic cleanup job (configurable, defaults to every hour) that scans Grafana for any operator-owned resources that don't have a corresponding CRD in the cluster anymore. This catches anything that slipped through — say, if the cluster crashed mid-delete and the reconciler never got to fire.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Simplified: what happens when a CRD is deleted&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;handleAlertDelete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="n"&gt;ctrl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctrl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;uid&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;grafanautil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GenerateStableUID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Check ownership before touching anything&lt;/span&gt;
    &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fetchAlertFromGrafana&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"createdby"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;"operator"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctrl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="c"&gt;// Not ours, don't touch it&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"cluster"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"CLUSTER_ID"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctrl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="c"&gt;// Different cluster owns this&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;deleteAlertFromGrafana&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cluster scoping was important for us because multiple clusters point at the same Grafana Cloud org. Without it, deleting a service from staging would delete its dashboards from production too. Now each cluster only manages resources it created.&lt;/p&gt;




&lt;h2&gt;
  
  
  Idempotency: Not Spamming the Grafana API
&lt;/h2&gt;

&lt;p&gt;Early versions of the operator would send every reconciled resource to Grafana on every loop, even if nothing had changed. This hammered the API and cluttered the audit log.&lt;/p&gt;

&lt;p&gt;The fix was hash-based idempotency. Before making an API call, the reconciler computes a SHA hash of the full payload and stores it in the CRD's status. On the next reconcile, it recomputes the hash and skips the API call if nothing changed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;computeSHA1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AlertHash&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"No change in alert rule, skipping sync"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctrl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A real-world reconciler fires a lot more than you'd expect — on pod restarts, leader elections, periodic resyncs. Without this check, every one of those events would hit the Grafana API unnecessarily.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-Cluster Safety
&lt;/h2&gt;

&lt;p&gt;This one bit us during testing. We had two clusters — prod and staging — both managed by their own operator instances, both pointing at the same Grafana Cloud org.&lt;/p&gt;

&lt;p&gt;When we deleted a service from staging, its operator dutifully deleted the Grafana dashboard. The problem: prod had a dashboard with the same name, created by prod's operator, and the stable UID generation (SHA1 of namespace + name) produced the same UID for both.&lt;/p&gt;

&lt;p&gt;The fix was the &lt;code&gt;cluster&lt;/code&gt; label. Every resource created by the operator is tagged with the &lt;code&gt;CLUSTER_ID&lt;/code&gt; environment variable. Before any delete or update operation, the operator checks whether the remote resource's cluster tag matches its own. If not, it skips:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existingCluster&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"CLUSTER_ID"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Skipping: resource belongs to different cluster"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"remoteCluster"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;existingCluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctrl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple, but it saved us from a painful production incident.&lt;/p&gt;




&lt;h2&gt;
  
  
  Grafana Cloud Support — Including the SLO Plugin
&lt;/h2&gt;

&lt;p&gt;Most open source Grafana tooling is built for OSS Grafana. We run Grafana Cloud with the paid SLO plugin, so the operator needed to support that too.&lt;/p&gt;

&lt;p&gt;SLOs have their own API (&lt;code&gt;/api/plugins/grafana-slo-app/resources/v1/slo&lt;/code&gt;) which is completely separate from the alert and dashboard APIs. The &lt;code&gt;GrafanaSLO&lt;/code&gt; CRD lets you define them declaratively:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.grafana-operator.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GrafanaSLO&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-availability&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Availability"&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;99.9&lt;/span&gt;
  &lt;span class="na"&gt;timeWindow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30d"&lt;/span&gt;
  &lt;span class="na"&gt;indicator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ratio&lt;/span&gt;
    &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grafanacloud-prom"&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;good&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http_requests_total{status!~"5.."}'&lt;/span&gt;
      &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_requests_total"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same lifecycle guarantees apply — the SLO gets cleaned up when the service goes away. Grafana auto-generates a companion SLO dashboard when the SLO is created; the operator manages the SLO definition but intentionally leaves that auto-generated dashboard alone.&lt;/p&gt;

&lt;p&gt;This is the feature that differentiates this operator most from existing tools — the &lt;a href="https://github.com/grafana/grafana-operator" rel="noopener noreferrer"&gt;official grafana/grafana-operator&lt;/a&gt; doesn't cover the Grafana Cloud SLO plugin, and &lt;a href="https://github.com/grafana/grafana-operator/issues/911" rel="noopener noreferrer"&gt;alert rule support has been an open feature request since 2022&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Drift Correction: Enforcing Git as the Source of Truth
&lt;/h2&gt;

&lt;p&gt;One of the most satisfying things to build was drift correction — the guarantee that if someone manually edits a resource in Grafana, the operator will revert it.&lt;/p&gt;

&lt;p&gt;For dashboards, the approach is hash-based: the operator computes a SHA256 of the dashboard JSON spec and stores it in the CRD status. On the next reconcile, if the hash matches the last sync, nothing happens. If it differs, the operator re-pushes the version from Git, overwriting any manual changes.&lt;/p&gt;

&lt;p&gt;To force an immediate revert without changing the YAML, there's a &lt;code&gt;force-sync&lt;/code&gt; annotation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl annotate grafanadashboard my-service-overview &lt;span class="se"&gt;\&lt;/span&gt;
  force-sync&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--overwrite&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; my-service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The controller watches for annotation changes (not just spec changes) and bypasses the hash check when it sees a new &lt;code&gt;force-sync&lt;/code&gt; value. Within seconds, the dashboard reverts to what's in Git.&lt;/p&gt;

&lt;p&gt;Alert rules behave differently — they're created via Grafana's provisioning API, which locks them from UI editing entirely. You'll see "This alert rule cannot be edited through the UI" in Grafana. That's intentional — the provisioning API enforces that the operator owns the resource, not the UI.&lt;/p&gt;

&lt;p&gt;SLOs occupy a middle ground: the operator manages the SLO definition (target, indicator, burn rate rules), and force-sync will revert any changes to those. The auto-generated companion dashboard is Grafana's territory.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CLI Generator: Lowering the Barrier to Entry
&lt;/h2&gt;

&lt;p&gt;Writing Grafana alert YAML by hand is painful. The query data structure alone has five nested levels. To make it easier for developers to add alerts to their services without needing to understand the full schema, the operator ships with an interactive CLI generator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;go run &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--generate-alert&lt;/span&gt;

🔤 Alert name: payment-timeout
📂 Folder name: payments-service
📝 Title: Payment Processing Timeout
🔌 Datasource &lt;span class="o"&gt;(&lt;/span&gt;default: grafanacloud-prom&lt;span class="o"&gt;)&lt;/span&gt;: grafanacloud-prom
➕ Add Step B? &lt;span class="o"&gt;(&lt;/span&gt;y/n&lt;span class="o"&gt;)&lt;/span&gt;: y
📣 Use contact point? &lt;span class="o"&gt;(&lt;/span&gt;y/n&lt;span class="o"&gt;)&lt;/span&gt;: y
🔔 Contact point name: payments-pagerduty

✅ Alert YAML written to payment-timeout-alert.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's not glamorous, but it cut the time for a developer to add a new alert from "ask the platform team" to about three minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Debugging When Something Goes Wrong
&lt;/h2&gt;

&lt;p&gt;If a resource isn't appearing in Grafana or isn't getting cleaned up, these four commands cover 90% of cases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Operator logs — shows every reconcile, API call, and error&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; grafana-operator-system logs deploy/grafana-cloud-operator &lt;span class="nt"&gt;--follow&lt;/span&gt;

&lt;span class="c"&gt;# Kubernetes events for a specific CRD&lt;/span&gt;
kubectl describe grafanaalertrule &amp;lt;name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &amp;lt;namespace&amp;gt;

&lt;span class="c"&gt;# All events in the operator namespace, newest first&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; grafana-operator-system get events &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'.lastTimestamp'&lt;/span&gt;

&lt;span class="c"&gt;# List all operator-managed CRDs across the cluster&lt;/span&gt;
kubectl get grafanaalertrule,grafanadashboard,grafanaslo &lt;span class="nt"&gt;-A&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most common issues are a wrong &lt;code&gt;GRAFANA_CLOUD_URL&lt;/code&gt; (no trailing slash), an API key without Editor permissions, or a &lt;code&gt;CLUSTER_ID&lt;/code&gt; mismatch causing the operator to skip resources it thinks belong to a different cluster.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Read the API spec more carefully before writing code.&lt;/strong&gt; The field for the rule group name in Grafana's alert payload is &lt;code&gt;ruleGroup&lt;/code&gt;, not &lt;code&gt;group&lt;/code&gt;. The interval belongs to the rule group, not the individual rule. These look obvious in hindsight but cost hours of debugging against the real API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add HTTP timeouts on day one.&lt;/strong&gt; All the Grafana API calls use a default HTTP client with no timeout. If the Grafana API is slow, the operator goroutine hangs indefinitely. It hasn't caused issues yet, but it's a known gap that should be fixed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write integration tests earlier.&lt;/strong&gt; The current test coverage only hits the dry-run paths. The real reconcile logic — ownership checks, deletion, orphan cleanup — isn't tested against a real API. Given that the operator can delete things in production Grafana, that's not a comfortable place to be.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Fits Alongside the Official Grafana Operator
&lt;/h2&gt;

&lt;p&gt;The official &lt;a href="https://github.com/grafana/grafana-operator" rel="noopener noreferrer"&gt;grafana/grafana-operator&lt;/a&gt; is a solid project and this isn't trying to replace it.&lt;/p&gt;

&lt;p&gt;The official operator is built for managing &lt;strong&gt;self-hosted Grafana instances&lt;/strong&gt; on Kubernetes — deploying Grafana itself, managing datasources, plugins, contact points. It's great at what it does.&lt;/p&gt;

&lt;p&gt;This operator solves a different problem: managing &lt;strong&gt;resources inside an existing Grafana Cloud org&lt;/strong&gt; — alert rules, dashboards, SLOs — as Kubernetes CRDs with GitOps lifecycle coupling. If you use self-hosted Grafana, the official operator is what you want. If you're on Grafana Cloud and want GitOps observability, this fills the gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The operator is open sourced at &lt;strong&gt;&lt;a href="https://github.com/nidhirai968/grafana-cloud-operator" rel="noopener noreferrer"&gt;github.com/nidhirai968/grafana-cloud-operator&lt;/a&gt;&lt;/strong&gt;. It's &lt;code&gt;v1alpha1&lt;/code&gt; — functional and battle-tested, but there's still plenty to improve.&lt;/p&gt;

&lt;p&gt;Getting started takes about 5 minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install CRDs&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; config/crd/bases/

&lt;span class="c"&gt;# Create credentials secret&lt;/span&gt;
kubectl create secret generic grafana-operator-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GRAFANA_CLOUD_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://your-instance.grafana.net &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GRAFANA_CLOUD_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-api-key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;CLUSTER_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-cluster-name &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; grafana-operator-system

&lt;span class="c"&gt;# Deploy&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; config/rbac/
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; config/manager/deployment.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then try the full lifecycle in one minute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Apply a sample alert rule&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; testdata/alert-rule.yaml

&lt;span class="c"&gt;# 2. Confirm it synced to Grafana Cloud&lt;/span&gt;
kubectl get grafanaalertrule &lt;span class="nt"&gt;-A&lt;/span&gt;

&lt;span class="c"&gt;# 3. Check operator logs to see reconcile activity&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; grafana-operator-system logs deploy/grafana-cloud-operator &lt;span class="nt"&gt;--follow&lt;/span&gt;

&lt;span class="c"&gt;# 4. Delete it — watch Grafana Cloud clean up automatically&lt;/span&gt;
kubectl delete &lt;span class="nt"&gt;-f&lt;/span&gt; testdata/alert-rule.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open your Grafana Cloud instance after step 1 and after step 4 — the alert appears and disappears without you touching Grafana.&lt;/p&gt;

&lt;p&gt;If you want to contribute — a Helm chart, integration tests, notification policy support — the &lt;a href="https://github.com/nidhirai968/grafana-cloud-operator/blob/main/CONTRIBUTING.md" rel="noopener noreferrer"&gt;CONTRIBUTING.md&lt;/a&gt; has the full list of known gaps and good starting points.&lt;/p&gt;




&lt;p&gt;The thing I didn't expect when building this was how much calmer incident response became once people trusted that Grafana alerts actually meant something. When you know that every alert in the system is owned by a live service, defined in code, and reviewable in Git — you stop ignoring the noise, because there is no noise.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm a platform engineer with 8 years building DevOps and cloud infrastructure. If this was useful, follow me for more posts on Kubernetes, observability, and platform engineering.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;kubernetes&lt;/code&gt; &lt;code&gt;grafana&lt;/code&gt; &lt;code&gt;devops&lt;/code&gt; &lt;code&gt;platform-engineering&lt;/code&gt; &lt;code&gt;observability&lt;/code&gt; &lt;code&gt;gitops&lt;/code&gt; &lt;code&gt;go&lt;/code&gt; &lt;code&gt;slo&lt;/code&gt; &lt;code&gt;alerting&lt;/code&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>grafana</category>
      <category>devops</category>
      <category>go</category>
    </item>
  </channel>
</rss>
