<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Soumyajyoti Mahalanobish</title>
    <description>The latest articles on DEV Community by Soumyajyoti Mahalanobish (@soumyajyoti-devops).</description>
    <link>https://dev.to/soumyajyoti-devops</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3138245%2Fa0497aef-784e-49d7-b94a-126c4f8fd20a.png</url>
      <title>DEV Community: Soumyajyoti Mahalanobish</title>
      <link>https://dev.to/soumyajyoti-devops</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/soumyajyoti-devops"/>
    <language>en</language>
    <item>
      <title>[Boost]</title>
      <dc:creator>Soumyajyoti Mahalanobish</dc:creator>
      <pubDate>Tue, 01 Jul 2025 07:57:38 +0000</pubDate>
      <link>https://dev.to/soumyajyoti-devops/-30jg</link>
      <guid>https://dev.to/soumyajyoti-devops/-30jg</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/soumyajyoti-devops/monitoring-celery-workers-with-flower-your-tasks-need-babysitting-3ime" class="crayons-story__hidden-navigation-link"&gt;Monitoring Celery Workers with Flower: Your Tasks Need Babysitting&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/soumyajyoti-devops" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3138245%2Fa0497aef-784e-49d7-b94a-126c4f8fd20a.png" alt="soumyajyoti-devops profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/soumyajyoti-devops" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Soumyajyoti Mahalanobish
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Soumyajyoti Mahalanobish
                
              
              &lt;div id="story-author-preview-content-2642299" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/soumyajyoti-devops" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3138245%2Fa0497aef-784e-49d7-b94a-126c4f8fd20a.png" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Soumyajyoti Mahalanobish&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/soumyajyoti-devops/monitoring-celery-workers-with-flower-your-tasks-need-babysitting-3ime" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jul 1 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/soumyajyoti-devops/monitoring-celery-workers-with-flower-your-tasks-need-babysitting-3ime" id="article-link-2642299"&gt;
          Monitoring Celery Workers with Flower: Your Tasks Need Babysitting
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opensource"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opensource&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/monitoring"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;monitoring&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/sitereliabilityengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;sitereliabilityengineering&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/soumyajyoti-devops/monitoring-celery-workers-with-flower-your-tasks-need-babysitting-3ime#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            8 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>opensource</category>
      <category>python</category>
      <category>monitoring</category>
      <category>sitereliabilityengineering</category>
    </item>
    <item>
      <title>Monitoring Celery Workers with Flower: Your Tasks Need Babysitting</title>
      <dc:creator>Soumyajyoti Mahalanobish</dc:creator>
      <pubDate>Tue, 01 Jul 2025 07:57:22 +0000</pubDate>
      <link>https://dev.to/soumyajyoti-devops/monitoring-celery-workers-with-flower-your-tasks-need-babysitting-3ime</link>
      <guid>https://dev.to/soumyajyoti-devops/monitoring-celery-workers-with-flower-your-tasks-need-babysitting-3ime</guid>
      <description>&lt;p&gt;So you've got Celery workers happily executing tasks in your Kubernetes cluster, but you're flying blind. Your workers could be on fire, stuck in an endless queue, and you'd be the one to blame here. &lt;/p&gt;

&lt;p&gt;We've all been there&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia4.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExendjdWRybXFydmpycDA2YXFhcDd3YTFuM283aDJ3cWEzNGtuaGt0dyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw%2FgNWHtg6zaW8uMjdPlw%2Fgiphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia4.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExendjdWRybXFydmpycDA2YXFhcDd3YTFuM283aDJ3cWEzNGtuaGt0dyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw%2FgNWHtg6zaW8uMjdPlw%2Fgiphy.gif" width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;where we're staring at logs hoping to divine the health of our distributed systems. Time to set up some proper monitoring.&lt;/p&gt;

&lt;p&gt;Celery is one way of doing distributed task processing, but it's opaque when it comes to observability. You can see logs, but logs don't tell you if workers are healthy, how long tasks are taking, or whether your queue is backing up. That's where Flower comes in, it's the one of the monitoring tools for Celery environments.&lt;/p&gt;

&lt;p&gt;This guide covers integrating Flower with Prometheus and Grafana to get proper metrics-driven monitoring. Whether you're using Grafana Cloud, self-hosted Grafana, the k8s-monitoring Helm chart, or individual components, we'll walk through the setup, explain why each piece matters, and tackle the gotchas.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You'll Need
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes cluster with Celery workers already running&lt;/li&gt;
&lt;li&gt;Some form of Prometheus-compatible metrics collection (Alloy, Prometheus Operator, plain Prometheus, etc.)&lt;/li&gt;
&lt;li&gt;Grafana instance (cloud or self-hosted)&lt;/li&gt;
&lt;li&gt;Basic Kubernetes knowledge&lt;/li&gt;
&lt;li&gt;Patience for the inevitable configuration mysteries&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Understanding the Architecture
&lt;/h2&gt;

&lt;p&gt;Before diving into configuration, let's understand what we're building. Flower sits between your Celery workers and your monitoring system. It connects to your message broker (Redis/RabbitMQ), watches worker activity, and exposes metrics in Prometheus format.&lt;/p&gt;

&lt;p&gt;The flow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Celery workers process tasks from the broker&lt;/li&gt;
&lt;li&gt;Flower monitors the broker and worker activity&lt;/li&gt;
&lt;li&gt;Flower exposes metrics at &lt;code&gt;/metrics&lt;/code&gt; endpoint&lt;/li&gt;
&lt;li&gt;Your metrics collector (Prometheus/Alloy) scrapes these metrics&lt;/li&gt;
&lt;li&gt;Grafana visualizes the data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight is that Flower doesn't directly monitor workers, it monitors the broker's state and worker events, which is why it can give you a complete picture of your distributed system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: Flower with Prometheus Metrics
&lt;/h2&gt;

&lt;p&gt;Here's the thing about Flower, it's great at showing you pretty graphs in its web UI, but getting it to export metrics for Prometheus requires a specific flag that's easy to miss. By default, Flower only exposes basic Python process metrics, which are useless for understanding your Celery workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploy Flower (the Right Way)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mher/flower:latest&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5555&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CELERY_BROKER_URL&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis://your-redis-service:6379/0"&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;celery&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--broker=redis://your-redis-service:6379/0&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--port=5555&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--prometheus_metrics&lt;/span&gt;  &lt;span class="c1"&gt;# This is the magic flag&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;--prometheus_metrics&lt;/code&gt; flag is doing the heavy lifting here. Without it, you'll get basic Python process metrics (memory usage, GC stats, etc.) but none of the Celery-specific goodness like worker status, task counts, or queue depths. This flag tells Flower to export its internal monitoring data in Prometheus format.&lt;/p&gt;

&lt;p&gt;The broker URL needs to match exactly what your Celery workers are using. Flower connects to the same broker to observe worker activity and task flow. If there's a mismatch, Flower won't see your workers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower-service&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;metrics&lt;/span&gt;  &lt;span class="c1"&gt;# Named ports help with service discovery&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5555&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5555&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The named port (&lt;code&gt;metrics&lt;/code&gt;) is crucial for ServiceMonitor configurations later. Many monitoring setups rely on port names rather than numbers for service discovery, making your configuration more resilient to port changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics Collection: Choose Your Adventure
&lt;/h2&gt;

&lt;p&gt;How you get these metrics into your monitoring system depends entirely on your infrastructure setup. Kubernetes monitoring has evolved into several different patterns, each with its own tradeoffs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: ServiceMonitor (Prometheus Operator/k8s-monitoring)
&lt;/h3&gt;

&lt;p&gt;ServiceMonitors are part of the Prometheus Operator ecosystem and provide declarative configuration for scrape targets. They're the cleanest approach if you're using Prometheus Operator or the k8s-monitoring Helm chart.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.coreos.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceMonitor&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower-metrics&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
  &lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;metrics&lt;/span&gt;      &lt;span class="c1"&gt;# References the named port&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/metrics&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
    &lt;span class="na"&gt;scrapeTimeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;20s&lt;/span&gt;
  &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchNames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;your-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical detail here is &lt;code&gt;port: metrics&lt;/code&gt; vs &lt;code&gt;targetPort: metrics&lt;/code&gt;. ServiceMonitors reference the service's port definition, not the container port directly. This indirection allows you to change container ports without updating monitoring configs.&lt;/p&gt;

&lt;p&gt;Getting this configuration right requires the same attention to detail as any other infrastructure code.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia4.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExb3R3eXJxNGJxajZzc2RkbTF3bWUxM241MzEzMjUzNXppNWQ1MnJpNyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw%2FJWnXY237vWeX3zx64V%2Fgiphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia4.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExb3R3eXJxNGJxajZzc2RkbTF3bWUxM241MzEzMjUzNXppNWQ1MnJpNyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw%2FJWnXY237vWeX3zx64V%2Fgiphy.gif" width="450" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One character difference can mean the difference between working monitoring and hours of debugging.&lt;/p&gt;

&lt;p&gt;Here, the &lt;code&gt;namespaceSelector&lt;/code&gt; restricts which namespaces this ServiceMonitor applies to. Without it, the ServiceMonitor tries to find matching services across all namespaces, which can cause confusion in multitenant clusters.&lt;/p&gt;
&lt;h3&gt;
  
  
  Option 2: Prometheus Annotations
&lt;/h3&gt;

&lt;p&gt;If you're using vanilla Prometheus with annotation based discovery, you configure scraping through service annotations. This is simpler but less flexible than ServiceMonitors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower-service&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;prometheus.io/scrape&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="na"&gt;prometheus.io/port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5555"&lt;/span&gt;
    &lt;span class="na"&gt;prometheus.io/path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/metrics"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flower&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;metrics&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5555&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5555&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The annotations tell Prometheus to scrape this service. Your Prometheus configuration needs to include a job that discovers services with these annotations. This approach is more straightforward but offers less control over scraping behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 3: Alloy Configuration (Manual)
&lt;/h3&gt;

&lt;p&gt;Grafana Alloy offers more flexibility than traditional Prometheus. You can configure complex discovery and relabeling rules to handle dynamic environments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add to your Alloy config&lt;/span&gt;
&lt;span class="s"&gt;discovery.kubernetes "flower_pods" {&lt;/span&gt;
  &lt;span class="s"&gt;role = "pod"&lt;/span&gt;
  &lt;span class="s"&gt;selectors {&lt;/span&gt;
    &lt;span class="s"&gt;role = "pod"&lt;/span&gt;
    &lt;span class="s"&gt;label = "app=flower"&lt;/span&gt;
  &lt;span class="s"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;

&lt;span class="s"&gt;discovery.relabel "flower_pods" {&lt;/span&gt;
  &lt;span class="s"&gt;targets = discovery.kubernetes.flower_pods.targets&lt;/span&gt;
  &lt;span class="s"&gt;rule {&lt;/span&gt;
    &lt;span class="s"&gt;source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"]&lt;/span&gt;
    &lt;span class="s"&gt;action = "keep"&lt;/span&gt;
    &lt;span class="s"&gt;regex = "true"&lt;/span&gt;
  &lt;span class="s"&gt;}&lt;/span&gt;
  &lt;span class="s"&gt;rule {&lt;/span&gt;
    &lt;span class="s"&gt;source_labels = ["__address__", "__meta_kubernetes_pod_annotation_prometheus_io_port"]&lt;/span&gt;
    &lt;span class="s"&gt;action = "replace"&lt;/span&gt;
    &lt;span class="s"&gt;regex = "([^:]+)(?:\\d+)?;(\\d+)"&lt;/span&gt;
    &lt;span class="s"&gt;replacement = "${1}:${2}"&lt;/span&gt;
    &lt;span class="s"&gt;target_label = "__address__"&lt;/span&gt;
  &lt;span class="s"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;

&lt;span class="s"&gt;prometheus.scrape "flower_metrics" {&lt;/span&gt;
  &lt;span class="s"&gt;targets    = discovery.relabel.flower_pods.output&lt;/span&gt;
  &lt;span class="s"&gt;forward_to = [prometheus.remote_write.your_destination.receiver]&lt;/span&gt;
  &lt;span class="s"&gt;scrape_interval = "30s"&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration discovers pods with the &lt;code&gt;app=flower&lt;/code&gt; label, applies relabeling rules to construct proper scrape targets, and forwards metrics to your storage backend. The relabeling rules transform Kubernetes metadata into the format Prometheus expects.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 4: Static Prometheus Config
&lt;/h3&gt;

&lt;p&gt;For simple setups or development environments, static configuration is the most straightforward approach.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prometheus.yml&lt;/span&gt;
&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;flower'&lt;/span&gt;
    &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;flower-service.your-namespace.svc.cluster.local:5555'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;metrics_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/metrics'&lt;/span&gt;
    &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This hardcodes the service endpoint, which works fine for stable environments but doesn't handle dynamic scaling or service changes gracefully.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verification: Making Sure It Actually Works
&lt;/h2&gt;

&lt;p&gt;Before diving into dashboard creation, verify that metrics are flowing correctly. This saves hours of troubleshooting later when you're wondering why your graphs are empty.&lt;/p&gt;

&lt;h3&gt;
  
  
  Check the Metrics Endpoint
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward svc/flower-service 5555:5555
curl http://localhost:5555/metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see metrics that look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flower_worker_online{worker="celery@worker-1"} 1.0
flower_events_total{task="process_data",type="task-sent"} 127.0
flower_worker_number_of_currently_executing_tasks{worker="celery@worker-1"} 3.0
flower_task_prefetch_time_seconds{task="process_data",worker="celery@worker-1"} 0.001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're only seeing basic Python metrics (&lt;code&gt;python_gc_objects_collected_total&lt;/code&gt;, &lt;code&gt;process_resident_memory_bytes&lt;/code&gt;, etc.), you're missing the &lt;code&gt;--prometheus_metrics&lt;/code&gt; flag. The Celery-specific metrics are what make this whole exercise worthwhile.&lt;/p&gt;

&lt;h3&gt;
  
  
  Check Your Monitoring System
&lt;/h3&gt;

&lt;p&gt;The verification process depends on your monitoring setup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For ServiceMonitor setups&lt;/strong&gt;: Check the Prometheus Operator or Alloy UI for discovered targets. Look for your Flower service in the targets list with status "UP".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For annotation-based&lt;/strong&gt;: Navigate to your Prometheus targets page (&lt;code&gt;/targets&lt;/code&gt;) and verify the Flower job appears with healthy status.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For manual configs&lt;/strong&gt;: Check your collector's logs for any scraping errors and verify the target appears in the monitoring system's target list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Metrics
&lt;/h2&gt;

&lt;p&gt;Flower exports several categories of metrics, each providing different insights into your Celery system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worker Metrics&lt;/strong&gt;: &lt;code&gt;flower_worker_online&lt;/code&gt; tells you which workers are active. &lt;code&gt;flower_worker_number_of_currently_executing_tasks&lt;/code&gt; shows current load per worker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Event Metrics&lt;/strong&gt;: &lt;code&gt;flower_events_total&lt;/code&gt; tracks task lifecycle events (sent, received, started, succeeded, failed). These form the basis for throughput and success rate calculations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timing Metrics&lt;/strong&gt;: &lt;code&gt;flower_task_runtime_seconds&lt;/code&gt; (histogram) shows task execution duration. &lt;code&gt;flower_task_prefetch_time_seconds&lt;/code&gt; measures queue wait time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Queue Metrics&lt;/strong&gt;: Various metrics help you understand queue depth and processing patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Useful Dashboards
&lt;/h2&gt;

&lt;p&gt;Now for the payoff - turning those metrics into actionable insights. The key is building dashboards that help you answer specific operational questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Essential Queries
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Worker Health Questions&lt;/strong&gt;: "Are my workers running? How many are active?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Total online workers
sum(flower_worker_online)

# Per-worker status
flower_worker_online
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Throughput Questions&lt;/strong&gt;: "How many tasks are we processing? Is throughput increasing?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Tasks being sent to workers (per second)
rate(flower_events_total{type="task-sent"}[5m])

# Tasks being processed (per second)
rate(flower_events_total{type="task-received"}[5m])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Queue Health Questions&lt;/strong&gt;: "Is my queue backing up? How long do tasks wait?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Tasks currently executing
sum(flower_worker_number_of_currently_executing_tasks)

# Time tasks spend waiting in queue
flower_task_prefetch_time_seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Performance Questions&lt;/strong&gt;: "How long do tasks take? Are they getting slower?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# 95th percentile task duration
histogram_quantile(0.95, rate(flower_task_runtime_seconds_bucket[5m]))

# Median task duration
histogram_quantile(0.50, rate(flower_task_runtime_seconds_bucket[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Dashboard Design Philosophy specifically for celery
&lt;/h3&gt;

&lt;p&gt;Start with high-level health indicators, then provide drill-down capabilities. A good Celery dashboard answers these questions in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;System Health&lt;/strong&gt;: Are workers running? Is the system processing tasks?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt;: How much work are we doing? Is it increasing or decreasing?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt;: How fast are tasks completing? Are there performance regressions?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queue Health&lt;/strong&gt;: Are tasks backing up? Where are the bottlenecks?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Scaling Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multiple Worker Types
&lt;/h3&gt;

&lt;p&gt;Real Celery deployments often have specialized workers for different task types. CPU-intensive tasks, I/O-bound tasks, and priority queues all need separate monitoring.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CPU-intensive work monitor&lt;/span&gt;
&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celery"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-A"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.cpu"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flower"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port=5555"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--prometheus_metrics"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# I/O-bound work monitor  &lt;/span&gt;
&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celery"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-A"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks.io"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flower"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port=5555"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--prometheus_metrics"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each Flower instance monitors a specific Celery app, giving you granular visibility into different workload types. You'll need separate services and scrape configurations for each instance.&lt;/p&gt;

&lt;p&gt;This approach lets you set different SLAs and alerting thresholds for different workload types. Your real-time fraud detection tasks might need sub-second response times, while your batch report generation can tolerate longer delays.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resource Allocation
&lt;/h3&gt;

&lt;p&gt;Flower itself is lightweight, but its resource needs scale with worker count and task frequency. A busy system with hundreds of workers and thousands of tasks per minute will use more memory to track state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;256Mi&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500m&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Self-Hosted Prometheus
&lt;/h3&gt;

&lt;p&gt;For self-hosted setups, configure Grafana to read from your Prometheus instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# grafana datasource config&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;datasources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prometheus&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus-server.monitoring.svc.cluster.local:9090&lt;/span&gt;
    &lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This assumes Prometheus and Grafana are in the same cluster. For cross-cluster or external access, you'll need appropriate networking and authentication configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security Considerations
&lt;/h2&gt;

&lt;p&gt;Production Flower deployments need proper security controls. Flower's web interface shows detailed information about your task processing, which could be sensitive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Authentication
&lt;/h3&gt;

&lt;p&gt;Enable basic authentication at minimum:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FLOWER_BASIC_AUTH&lt;/span&gt;
  &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;admin:your_secure_password"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For production systems, consider OAuth integration or running Flower behind an authentication proxy. Celery-exporter provides similar metrics without the web interface overhead. It's purpose-built for Prometheus integration and might use fewer resources than Flower. However, you lose Flower's web interface for ad-hoc investigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caution!
&lt;/h2&gt;

&lt;p&gt;Getting Celery monitoring right requires attention to several key details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;--prometheus_metrics&lt;/code&gt; flag transforms Flower from a simple web interface into a proper metrics exporter&lt;/li&gt;
&lt;li&gt;Your metrics collection method should match your infrastructure setup and operational preferences&lt;/li&gt;
&lt;li&gt;ServiceMonitor port configuration matters - &lt;code&gt;port&lt;/code&gt; references service ports, &lt;code&gt;targetPort&lt;/code&gt; references container ports&lt;/li&gt;
&lt;li&gt;Label matching between ServiceMonitors, services, and pods must be exact&lt;/li&gt;
&lt;li&gt;Your monitoring system's target discovery UI is invaluable for debugging configuration issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The setup might seem complicated, but each piece serves a specific purpose in building a robust monitoring system. Once you have this foundation, you can extend it with alerting rules, additional dashboards, and integration with your incident response workflow.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>python</category>
      <category>monitoring</category>
      <category>sitereliabilityengineering</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Soumyajyoti Mahalanobish</dc:creator>
      <pubDate>Thu, 05 Jun 2025 05:16:27 +0000</pubDate>
      <link>https://dev.to/soumyajyoti-devops/-23nj</link>
      <guid>https://dev.to/soumyajyoti-devops/-23nj</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48" class="crayons-story__hidden-navigation-link"&gt;A Young Engineer's Guide to SLIs, SLOs, SLAs&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/soumyajyoti-devops" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3138245%2Fa0497aef-784e-49d7-b94a-126c4f8fd20a.png" alt="soumyajyoti-devops profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/soumyajyoti-devops" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Soumyajyoti Mahalanobish
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Soumyajyoti Mahalanobish
                
              
              &lt;div id="story-author-preview-content-2561193" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/soumyajyoti-devops" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3138245%2Fa0497aef-784e-49d7-b94a-126c4f8fd20a.png" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Soumyajyoti Mahalanobish&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jun 5 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48" id="article-link-2561193"&gt;
          A Young Engineer's Guide to SLIs, SLOs, SLAs
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            16 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>This one's a long one 👀</title>
      <dc:creator>Soumyajyoti Mahalanobish</dc:creator>
      <pubDate>Thu, 05 Jun 2025 05:08:00 +0000</pubDate>
      <link>https://dev.to/soumyajyoti-devops/this-ones-a-long-one-2c03</link>
      <guid>https://dev.to/soumyajyoti-devops/this-ones-a-long-one-2c03</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48" class="crayons-story__hidden-navigation-link"&gt;A Young Engineer's Guide to SLIs, SLOs, SLAs&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/soumyajyoti-devops" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3138245%2Fa0497aef-784e-49d7-b94a-126c4f8fd20a.png" alt="soumyajyoti-devops profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/soumyajyoti-devops" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Soumyajyoti Mahalanobish
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Soumyajyoti Mahalanobish
                
              
              &lt;div id="story-author-preview-content-2561193" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/soumyajyoti-devops" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3138245%2Fa0497aef-784e-49d7-b94a-126c4f8fd20a.png" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Soumyajyoti Mahalanobish&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jun 5 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48" id="article-link-2561193"&gt;
          A Young Engineer's Guide to SLIs, SLOs, SLAs
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            16 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>A Young Engineer's Guide to SLIs, SLOs, SLAs</title>
      <dc:creator>Soumyajyoti Mahalanobish</dc:creator>
      <pubDate>Thu, 05 Jun 2025 05:06:16 +0000</pubDate>
      <link>https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48</link>
      <guid>https://dev.to/soumyajyoti-devops/a-young-engineers-guide-to-slis-slos-slas-2g48</guid>
      <description>&lt;p&gt;If you're an early-career engineer drowning in observability dashboards, wrestling with Prometheus queries in Grafana Explore page, and wondering what exatly lies beyond your observability stack setup, this guide is for you. This is a primer to proactive reliability engineering using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).&lt;/p&gt;

&lt;p&gt;Let's assume a fictional SaaS platform - &lt;strong&gt;PaymentPro&lt;/strong&gt;, a payment processing service - and we will learn how to implement proper step by step reliability practices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But before diving into PromQL queries and YAML configs, let's clarify what these acronyms actually mean&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SLIs (Service Level Indicators)&lt;/strong&gt; are the metrics that matter to your users. Instead of CPU usage, think "percentage of payments processed successfully." They're quantitative measurements of service behavior as experienced by users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SLOs (Service Level Objectives)&lt;/strong&gt; are your internal promises. "We aim to process 99.9% of payments successfully" - that's an SLO. It's what you're shooting for, not what you're contractually obligated to deliver.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SLAs (Service Level Agreements)&lt;/strong&gt; are the legal promises you make to customers, usually with financial consequences. "We guarantee 99.5% payment success rate or you get service credits" - that's an SLA. Always set your SLAs looser than your SLOs to give yourself breathing room.&lt;/p&gt;

&lt;p&gt;Think of it this way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SLI = What you measure&lt;/li&gt;
&lt;li&gt;SLO = What you promise yourself&lt;/li&gt;
&lt;li&gt;SLA = What you promise customers (with lawyers involved)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The two faces of SLIs: Infrastructure vs Application
&lt;/h3&gt;

&lt;p&gt;Here's what many engineers miss: SLIs exist at multiple layers. You need both infrastructure SLIs and application functionality SLIs to get the complete picture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure SLIs&lt;/strong&gt; answer the question: "Is my service reachable and responsive?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API endpoint availability&lt;/li&gt;
&lt;li&gt;Response time&lt;/li&gt;
&lt;li&gt;Error rates&lt;/li&gt;
&lt;li&gt;Throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Application Functionality SLIs&lt;/strong&gt; answer the question: "Is my service doing what users expect?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business transactions completed successfully&lt;/li&gt;
&lt;li&gt;Data consistency maintained&lt;/li&gt;
&lt;li&gt;User workflows functioning correctly&lt;/li&gt;
&lt;li&gt;Business rules properly enforced&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For PaymentPro, here's the difference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Infrastructure SLI: "Can users reach the payment API?"
sum(rate(http_requests_total{status!~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))

# Application SLI: "Can users actually complete payments?"
sum(rate(payments_completed_total{status="success"}[5m])) /
sum(rate(payments_initiated_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API might be up (infrastructure ✓) but payments could be failing due to a third-party service issue (application ✗).&lt;/p&gt;

&lt;h2&gt;
  
  
  The SLI discovery process: Just ask questions
&lt;/h2&gt;

&lt;p&gt;Before you write a single PromQL query, you need to understand what actually matters to your users and business. Here's a framework for discovering the right SLIs through conversations with different stakeholders.&lt;/p&gt;

&lt;h3&gt;
  
  
  Questions for product folks
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"What does success look like for a user of this service?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expected answer: "Users can process payments quickly and reliably"&lt;/li&gt;
&lt;li&gt;Your follow-up: "Define 'quickly' - is that 200ms? 2 seconds? What's the threshold where users complain?"&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"What are the critical user journeys?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Map out step-by-step what users do&lt;/li&gt;
&lt;li&gt;Identify which steps are mandatory vs optional&lt;/li&gt;
&lt;li&gt;Understand dependencies between steps&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"What complaints do we get from users?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Look for patterns: "payments are slow" → latency SLI&lt;/li&gt;
&lt;li&gt;"I can't see my payment history" → data availability SLI&lt;/li&gt;
&lt;li&gt;"payments sometimes disappear" → consistency SLI&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"What would make a customer leave us for a competitor?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This reveals your true SLAs&lt;/li&gt;
&lt;li&gt;Often different from what engineering thinks&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Questions to developers
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;"What keeps you up at night about this service?"&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Developer: "The payment webhook system. If it fails, merchants don't get notifications."
   You: "How often does it fail? How do we know when it fails?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"What are the critical code paths?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have them trace through the code&lt;/li&gt;
&lt;li&gt;Identify external dependencies&lt;/li&gt;
&lt;li&gt;Understand retry logic and failure modes&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"What metrics do you check during an incident?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;These are your SLI candidates&lt;/li&gt;
&lt;li&gt;Ask: "Why this metric and not others?"&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"What's the difference between the service being 'up' vs 'working correctly'?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This reveals the gap between infrastructure and application SLIs&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Questions for customer support
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"What are the top 3 issues customers report?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quantify: how many tickets per week?&lt;/li&gt;
&lt;li&gt;Severity: how angry are customers?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"How do you know when something is wrong before customers complain?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They often have informal monitoring&lt;/li&gt;
&lt;li&gt;These instincts can become SLIs&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The metric audit worksheet
&lt;/h3&gt;

&lt;p&gt;Create a spreadsheet during discovery:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;User Impact&lt;/th&gt;
&lt;th&gt;Measurable?&lt;/th&gt;
&lt;th&gt;Candidate?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Payment success rate&lt;/td&gt;
&lt;td&gt;Devs&lt;/td&gt;
&lt;td&gt;Application&lt;/td&gt;
&lt;td&gt;Direct - payment fails&lt;/td&gt;
&lt;td&gt;Yes - payment_status&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API latency&lt;/td&gt;
&lt;td&gt;Ops&lt;/td&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Indirect - UX degraded&lt;/td&gt;
&lt;td&gt;Yes - histogram&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database replication lag&lt;/td&gt;
&lt;td&gt;Ops&lt;/td&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Indirect - stale data&lt;/td&gt;
&lt;td&gt;Yes - seconds_behind&lt;/td&gt;
&lt;td&gt;Maybe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Background job queue depth&lt;/td&gt;
&lt;td&gt;Devs&lt;/td&gt;
&lt;td&gt;Application&lt;/td&gt;
&lt;td&gt;Delayed - notifications late&lt;/td&gt;
&lt;td&gt;Yes - queue_size&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory usage&lt;/td&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;None until OOM&lt;/td&gt;
&lt;td&gt;Yes - percentage&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The SLI selection framework
&lt;/h3&gt;

&lt;p&gt;Not every metric should become an SLI. Here's a decision framework to evaluate each candidate:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Does it have user impact?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users directly experience this when it fails&lt;/li&gt;
&lt;li&gt;It's purely internal with no user visibility&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If no → Stop here, it's not an SLI&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Can we measure it reliably?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We have consistent, accurate data&lt;/li&gt;
&lt;li&gt;Data is sporadic, estimated, or manually collected&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If no → Stop here, find a measurable proxy&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Is it actionable?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When it degrades, we know what to fix&lt;/li&gt;
&lt;li&gt;It's just an interesting number with no clear response&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If no → Stop here, it won't drive improvements&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Is it a cause or just a symptom?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It's the root cause of user issues&lt;/li&gt;
&lt;li&gt;It's a symptom of deeper problems&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If symptom → Consider tracking the underlying cause instead&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Does it cover critical user journeys?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It represents core functionality users depend on&lt;/li&gt;
&lt;li&gt;It's a nice-to-have feature&lt;/li&gt;
&lt;li&gt;&lt;em&gt;If yes → Strong SLI candidate!&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How this looks in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Many teams codify this decision tree for consistency
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_be_sli&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: User impact
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;has_user_impact&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No direct user impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: Measurability
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_measurable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cannot measure reliably&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: Actionability
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_actionable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No clear action when it degrades&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 4: Is it a symptom or a cause?
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_symptom&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Symptoms can be indicators, but prefer causes
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Maybe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Consider the underlying cause instead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 5: Coverage
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;covers_critical_user_journey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Covers critical functionality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Maybe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Evaluate against other candidates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Real-world example: Payment service SLI discovery
&lt;/h3&gt;

&lt;p&gt;Let's walk through discovering SLIs for our PaymentPro service:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session with Product Manager:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: "What does success look like for payment processing?"
PM: "Merchants can accept payments 24/7, funds arrive quickly, and they get real-time notifications."

You: "Let's break that down. What does '24/7' mean exactly?"
PM: "The API should always accept payment attempts. Even if a payment fails due to insufficient funds, the API itself should respond."

You: "How quickly should funds arrive?"
PM: "For card payments, authorization within 2 seconds. Settlement is daily batch, not real-time."

You: "What about notifications?"
PM: "Webhooks should fire within 30 seconds of payment completion. Merchants build their whole flow around these."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Extracted SLI candidates:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;API availability (infrastructure)&lt;/li&gt;
&lt;li&gt;Payment authorization latency (application)&lt;/li&gt;
&lt;li&gt;Webhook delivery time (application)&lt;/li&gt;
&lt;li&gt;Webhook delivery success rate (application)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Session with Lead Developer:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: "Walk me through a payment request."
Dev: "Request hits API gateway → auth service → fraud check → payment processor → webhook queue → notification service."

You: "What can go wrong at each step?"
Dev: "Gateway: rare issues. Auth: tokens expire. Fraud check: sometimes times out. Payment processor: most failures here. Webhook: queue can back up."

You: "How do you know the payment actually processed?"
Dev: "We check the payment_status in the database, but also reconcile with the processor daily."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Additional SLI candidates:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Payment reconciliation accuracy (application)&lt;/li&gt;
&lt;li&gt;Fraud check availability (infrastructure)&lt;/li&gt;
&lt;li&gt;Queue processing time (application)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Categorizing your SLIs
&lt;/h3&gt;

&lt;p&gt;Once you have candidates, organize them into three main categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure SLIs&lt;/strong&gt; - The foundation layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;API Availability&lt;/strong&gt;: Is the payment API responding to requests?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metric: HTTP request success rate&lt;/li&gt;
&lt;li&gt;Threshold: Less than 0.1% 5xx errors&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;API Latency&lt;/strong&gt;: How fast does the API respond?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metric: Response time distribution&lt;/li&gt;
&lt;li&gt;Threshold: 95th percentile under 200ms&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Application SLIs&lt;/strong&gt; - The business logic layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Payment Success Rate&lt;/strong&gt;: Are payments completing successfully?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metric: Ratio of completed vs initiated payments&lt;/li&gt;
&lt;li&gt;Threshold: 99.9% success rate&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Webhook Delivery&lt;/strong&gt;: Are merchants getting notifications?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metric: Time from payment to webhook delivery&lt;/li&gt;
&lt;li&gt;Threshold: 99th percentile under 30 seconds&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Payment Reconciliation&lt;/strong&gt;: Do our records match the payment processor?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metric: Daily reconciliation mismatches&lt;/li&gt;
&lt;li&gt;Threshold: Zero tolerance for mismatches&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business SLIs&lt;/strong&gt; - The customer experience layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Merchant Activation Time&lt;/strong&gt;: How quickly can new merchants start accepting payments?

&lt;ul&gt;
&lt;li&gt;Metric: Time from signup to first successful payment&lt;/li&gt;
&lt;li&gt;Threshold: 90% activated within 24 hours&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How teams typically document this:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# sli-inventory.yaml - Many teams maintain their SLI catalog in version control&lt;/span&gt;
&lt;span class="na"&gt;sli_inventory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;infrastructure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api_availability&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;responding&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;requests"&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_requests_total"&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;500"&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api_latency&lt;/span&gt;  
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;time"&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_request_duration_seconds"&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p95&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;200ms"&lt;/span&gt;

  &lt;span class="na"&gt;application&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment_success_rate&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payments&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;successfully"&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_status"&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'completed'"&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;webhook_delivery&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Webhooks&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;delivered&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;within&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SLA"&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;webhook_delivery_duration_seconds"&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;30s"&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment_reconciliation&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;match&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;processor"&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reconciliation_mismatches_total"&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mismatches&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;

  &lt;span class="na"&gt;business&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;merchant_activation_time&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Time&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;signup&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;first&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;payment"&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;merchant_activation_hours"&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p90&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;24h"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This structured format makes it easy to review SLIs with stakeholders and import into monitoring tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  The metric instrumentation checklist
&lt;/h3&gt;

&lt;p&gt;For each chosen SLI, ensure you can answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What's the exact definition?&lt;/strong&gt; (e.g., "successful payment" means status=completed AND reconciled=true)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where is it measured?&lt;/strong&gt; (at the API gateway? In the application? At the database?)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What are the edge cases?&lt;/strong&gt; (retries? partial failures? timeouts?)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How do we handle missing data?&lt;/strong&gt; (gaps in metrics should fail open or closed?)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's the unit and precision?&lt;/strong&gt; (milliseconds? percentage to 2 decimal places?)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can we simulate failures?&lt;/strong&gt; (for testing alerts and dashboards)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Choosing meaningful SLIs for your SaaS
&lt;/h2&gt;

&lt;p&gt;Now that we've discovered what matters through stakeholder interviews, let's implement both infrastructure and application SLIs for PaymentPro.&lt;/p&gt;

&lt;h3&gt;
  
  
  The four golden signals (with both perspectives)
&lt;/h3&gt;

&lt;p&gt;Google's SRE book identifies four golden signals. Let's implement each from both infrastructure and application viewpoints:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Latency
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure perspective&lt;/strong&gt;: How fast does the API respond?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Infrastructure latency - API response time
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Application perspective&lt;/strong&gt;: How fast do payments complete end-to-end?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Application latency - Full payment processing time
histogram_quantile(0.95,
  sum(rate(payment_processing_duration_seconds_bucket[5m])) by (le)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. Traffic
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure perspective&lt;/strong&gt;: How many requests are we receiving?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Infrastructure traffic - Requests per second
sum(rate(http_requests_total[5m])) by (service)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Application perspective&lt;/strong&gt;: How many business transactions are happening?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Application traffic - Payment attempts per minute
sum(rate(payment_attempts_total[1m])) by (payment_type, merchant_tier)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. Errors
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure perspective&lt;/strong&gt;: HTTP errors&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Infrastructure errors - 5xx responses
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Application perspective&lt;/strong&gt;: Business logic failures&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Application errors - Payment failures (including valid declines)
sum(rate(payment_attempts_total{result!="success"}[5m])) by (failure_reason) /
sum(rate(payment_attempts_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4. Saturation
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure perspective&lt;/strong&gt;: Resource utilization&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Infrastructure saturation - Database connection pool
(db_connections_active / db_connections_max) &amp;gt; 0.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Application perspective&lt;/strong&gt;: Business capacity limits&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Application saturation - Merchant transaction limits
(sum(rate(payment_volume_dollars[1h])) by (merchant_id) / 
 merchant_transaction_limit) &amp;gt; 0.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Implementing comprehensive SLIs
&lt;/h3&gt;

&lt;p&gt;Let's create a complete SLI implementation that captures both layers. For each SLI, we need to define:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For ratio-based SLIs (like availability):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Good events&lt;/strong&gt;: Requests that succeeded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total events&lt;/strong&gt;: All requests attempted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For threshold-based SLIs (like latency):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Threshold metric&lt;/strong&gt;: The specific boundary we're measuring against&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's how to structure your SLIs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure SLIs:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;API Availability&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good events: All non-5xx responses&lt;/li&gt;
&lt;li&gt;Total events: All HTTP requests&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;API Latency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Threshold: 95th percentile &amp;lt; 200ms&lt;/li&gt;
&lt;li&gt;Measured at: Load balancer or API gateway&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Application SLIs:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Payment Success Rate&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good events: Payments with status "completed"&lt;/li&gt;
&lt;li&gt;Total events: All payment attempts&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Payment Consistency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good events: Reconciliation checks that match&lt;/li&gt;
&lt;li&gt;Total events: All reconciliation checks&lt;/li&gt;
&lt;li&gt;Note: Run every 30 minutes for accuracy&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Webhook Delivery&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good events: Webhooks delivered within 30s&lt;/li&gt;
&lt;li&gt;Total events: All webhook attempts&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;In practice, teams often define these in a structured format for their SLO tools:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# payment-service-slis.yaml&lt;/span&gt;
&lt;span class="c1"&gt;# This format works with tools like Sloth, Pyrra, or OpenSLO&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-processor&lt;/span&gt;
&lt;span class="na"&gt;slis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Infrastructure SLIs&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api_availability&lt;/span&gt;
    &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;infrastructure&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reachability"&lt;/span&gt;
    &lt;span class="na"&gt;implementation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;good_events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(rate(http_requests_total{status!~"5.."}[5m]))&lt;/span&gt;
      &lt;span class="na"&gt;total_events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(rate(http_requests_total[5m]))&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api_latency&lt;/span&gt;
    &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;infrastructure&lt;/span&gt;  
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;95th&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;percentile"&lt;/span&gt;
    &lt;span class="na"&gt;implementation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;threshold_metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;histogram_quantile(0.95,&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(http_request_duration_seconds_bucket[5m])) by (le)&lt;/span&gt;
        &lt;span class="s"&gt;) &amp;lt; bool 0.2  # 200ms threshold&lt;/span&gt;

  &lt;span class="c1"&gt;# Application SLIs&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment_success_rate&lt;/span&gt;
    &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;application&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Successful&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;completion&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate"&lt;/span&gt;
    &lt;span class="na"&gt;implementation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;good_events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(rate(payments_total{status="completed"}[5m]))&lt;/span&gt;
      &lt;span class="na"&gt;total_events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(rate(payments_total[5m]))&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment_consistency&lt;/span&gt;
    &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;application&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;consistency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;processor"&lt;/span&gt;
    &lt;span class="na"&gt;implementation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;good_events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(rate(payment_reconciliation_matches_total[30m]))&lt;/span&gt;
      &lt;span class="na"&gt;total_events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(rate(payment_reconciliation_checks_total[30m]))&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;webhook_delivery_sli&lt;/span&gt;
    &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;application&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Webhooks&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;delivered&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;within&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;30&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;seconds"&lt;/span&gt;
    &lt;span class="na"&gt;implementation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;good_events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(rate(webhook_deliveries_total{delivered_within_sla="true"}[5m]))&lt;/span&gt;
      &lt;span class="na"&gt;total_events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(rate(webhook_deliveries_total[5m]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This structured approach ensures consistency across teams and makes it easy to generate dashboards and alerts automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advanced application-level SLIs
&lt;/h3&gt;

&lt;p&gt;Some SLIs require custom business logic to calculate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Instrumenting application-specific SLIs in your code&lt;/span&gt;
&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;

&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c"&gt;// Business transaction SLI&lt;/span&gt;
    &lt;span class="n"&gt;paymentCompleteness&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCounterVec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CounterOpts&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"payment_business_completeness_total"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Help&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Tracks if payment completed all business requirements"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"complete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"missing_step"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Data quality SLI&lt;/span&gt;
    &lt;span class="n"&gt;paymentDataQuality&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewGaugeVec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GaugeOpts&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"payment_data_quality_score"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Help&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Score indicating payment data completeness"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"merchant_id"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;ProcessPayment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Payment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Track infrastructure metric&lt;/span&gt;
    &lt;span class="n"&gt;startTime&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;httpDuration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;startTime&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Seconds&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;

    &lt;span class="c"&gt;// Business logic&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;validatePayment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Process payment...&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;processWithProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Track application metrics&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Success&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Check business completeness&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasReceipt&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasWebhook&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsReconciled&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;paymentCompleteness&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithLabelValues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"none"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;missing&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;identifyMissingSteps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;paymentCompleteness&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithLabelValues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c"&gt;// Calculate data quality score&lt;/span&gt;
        &lt;span class="n"&gt;qualityScore&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;calculateDataQuality&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;paymentDataQuality&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithLabelValues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MerchantID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qualityScore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Helper to calculate business-specific quality metrics&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;calculateDataQuality&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Payment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0.0&lt;/span&gt;
    &lt;span class="n"&gt;maxScore&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;5.0&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasFullAddress&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasValidEmail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasPhoneNumber&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FraudScoreAvailable&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Has3DSecure&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;maxScore&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Composite SLIs for complex user journeys
&lt;/h3&gt;

&lt;p&gt;Real user experiences often span multiple services. Here's how to create SLIs that reflect this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# End-to-end payment flow SLI
# User can: authenticate → create payment → receive confirmation
(
  # Auth service available
  avg_over_time(sli:auth_availability:5m[30s]) *
  # Payment service can process
  avg_over_time(sli:payment_success_rate:5m[30s]) *
  # Notification service delivers
  avg_over_time(sli:notification_delivery:5m[30s])
) &amp;gt; 0.995  # All three must work 99.5% of the time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Choosing between infrastructure and application SLIs
&lt;/h3&gt;

&lt;p&gt;Use this decision matrix:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Infrastructure SLI&lt;/th&gt;
&lt;th&gt;Application SLI&lt;/th&gt;
&lt;th&gt;Both&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API is up but returning empty data&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payments process but webhooks fail&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database is slow&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Best&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Third-party API degraded&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory leak causing OOM&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business logic bug&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pro tip: Start with infrastructure SLIs (easier to implement) but quickly add application SLIs (more meaningful).&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting realistic SLOs (and not shooting yourself in the foot)
&lt;/h2&gt;

&lt;p&gt;Here's where many teams mess up: they set SLOs based on current system performance rather than user needs. Just because your system &lt;em&gt;can&lt;/em&gt; deliver 99.99% availability doesn't mean you &lt;em&gt;should&lt;/em&gt; promise it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The cost of nines
&lt;/h3&gt;

&lt;p&gt;Let's do some &lt;strong&gt;error budget&lt;/strong&gt; math. Error budget is the amount of unreliability you're allowed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error Budget = 100% - SLO Target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For different SLO targets over 30 days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;99% = 7.2 hours of downtime allowed&lt;/li&gt;
&lt;li&gt;99.9% = 43.2 minutes of downtime allowed
&lt;/li&gt;
&lt;li&gt;99.95% = 21.6 minutes of downtime allowed&lt;/li&gt;
&lt;li&gt;99.99% = 4.32 minutes of downtime allowed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each additional nine roughly costs 10x more in engineering effort. That jump from 99.9% to 99.99%? That's the difference between "we can deploy during business hours" and "every deployment requires a war room."&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical SLO setting
&lt;/h3&gt;

&lt;p&gt;For PaymentPro, let's be smart about our SLOs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# payment-service-slos.yaml&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-service&lt;/span&gt;
&lt;span class="na"&gt;slos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-availability&lt;/span&gt;
    &lt;span class="na"&gt;sli&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Percentage&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;successful&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;calls"&lt;/span&gt;
    &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
        &lt;span class="na"&gt;objective&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;99.9&lt;/span&gt;  &lt;span class="c1"&gt;# 43 minutes of error budget per month&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;  
        &lt;span class="na"&gt;objective&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;99.0&lt;/span&gt;  &lt;span class="c1"&gt;# More relaxed for testing&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-latency&lt;/span&gt;
    &lt;span class="na"&gt;sli&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;95th&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;percentile&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;time"&lt;/span&gt;
    &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
        &lt;span class="na"&gt;objective&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;200ms&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;
        &lt;span class="na"&gt;objective&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementing error budgets that actually work
&lt;/h2&gt;

&lt;p&gt;Error budgets transform the traditional dev vs. ops conflict into a shared objective. Here's the revolutionary idea: &lt;strong&gt;unreliability is a feature, not a bug&lt;/strong&gt;. You "spend" your error budget on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Feature deployments&lt;/li&gt;
&lt;li&gt;Infrastructure migrations&lt;/li&gt;
&lt;li&gt;Experiments&lt;/li&gt;
&lt;li&gt;Accepted risk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the budget is exhausted, you stop feature work and focus on reliability. It's that simple.&lt;/p&gt;

&lt;p&gt;Let's implement error budget tracking:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Error budget remaining (30-day window)
# Assuming 99.9% SLO
(
  (0.001) -  # Total budget (1 - 0.999)
  (1 - (sum(rate(payment_requests_total{status!~"5.."}[30d])) / 
        sum(rate(payment_requests_total[30d]))))
) / 0.001 * 100  # Percentage of budget remaining
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Burn rate alerts - Your new best friend
&lt;/h3&gt;

&lt;p&gt;Traditional threshold alerts are like smoke detectors that only go off when the house is already on fire. Burn rate alerts warn you when you're consuming error budget too quickly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-slo-alerts&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Alert if we're burning budget 10x too fast&lt;/span&gt;
      &lt;span class="c1"&gt;# This would exhaust the monthly budget in 3 days&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PaymentHighErrorBudgetBurn&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;# 1-hour burn rate&lt;/span&gt;
            &lt;span class="s"&gt;(1 - sum(rate(payment_requests_total{status!~"5.."}[1h])) / &lt;/span&gt;
                 &lt;span class="s"&gt;sum(rate(payment_requests_total[1h]))) &amp;gt; (10 * 0.001)&lt;/span&gt;
          &lt;span class="s"&gt;) and (&lt;/span&gt;
            &lt;span class="s"&gt;# 5-minute burn rate (to avoid flapping)&lt;/span&gt;
            &lt;span class="s"&gt;(1 - sum(rate(payment_requests_total{status!~"5.."}[5m])) / &lt;/span&gt;
                 &lt;span class="s"&gt;sum(rate(payment_requests_total[5m]))) &amp;gt; (10 * 0.001)&lt;/span&gt;
          &lt;span class="s"&gt;)&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
          &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;burning&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;budget&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;10x&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;too&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fast"&lt;/span&gt;
          &lt;span class="na"&gt;dashboard&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://grafana.paymentpro.com/d/slo-payment"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The magic of multi-window burn rates: combining short (5m) and long (1h) windows prevents alert flapping while ensuring quick detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recording rules for SLI efficiency
&lt;/h2&gt;

&lt;p&gt;Calculating SLIs on every dashboard load is expensive. Recording rules pre-calculate and store results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sli_recording_rules&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# 5-minute payment success rate&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sli:payment_success_rate:5m&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(payment_requests_total{status!~"5.."}[5m])) by (service, environment) /&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(payment_requests_total[5m])) by (service, environment)&lt;/span&gt;

      &lt;span class="c1"&gt;# 30-day payment success rate for error budget&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sli:payment_success_rate:30d&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(payment_requests_total{status!~"5.."}[30d])) by (service, environment) /&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(payment_requests_total[30d])) by (service, environment)&lt;/span&gt;

      &lt;span class="c1"&gt;# 95th percentile latency&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sli:payment_latency_p95:5m&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;histogram_quantile(0.95,&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(payment_request_duration_seconds_bucket[5m])) by (service, environment, le)&lt;/span&gt;
          &lt;span class="s"&gt;)&lt;/span&gt;

      &lt;span class="c1"&gt;# Error budget consumption rate&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error_budget:payment:consumption_rate&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;(1 - sli:payment_success_rate:5m) / 0.001  # 0.001 = 1 - 0.999 SLO&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Aggregating service SLOs into product SLOs
&lt;/h2&gt;

&lt;p&gt;PaymentPro isn't just one service - it's multiple microservices working together. How do we create product-level SLOs?&lt;/p&gt;

&lt;h3&gt;
  
  
  Critical path aggregation
&lt;/h3&gt;

&lt;p&gt;For user journeys that require multiple services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# User can complete payment only if ALL services work
min(
  sli:payment_success_rate:5m{service="auth"},
  sli:payment_success_rate:5m{service="fraud-check"},
  sli:payment_success_rate:5m{service="payment-processor"},
  sli:payment_success_rate:5m{service="notification"}
) by (environment)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Weighted aggregation
&lt;/h3&gt;

&lt;p&gt;When services have different importance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Weighted by request volume
sum(
  sli:payment_success_rate:5m * 
  sum(rate(payment_requests_total[5m])) by (service)
) / sum(sum(rate(payment_requests_total[5m])) by (service))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Building a complete SLO hierarchy
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# product-slos.yaml&lt;/span&gt;
&lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PaymentPro&lt;/span&gt;
&lt;span class="na"&gt;components&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auth&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
    &lt;span class="na"&gt;criticality&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;
    &lt;span class="na"&gt;slo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;99.95&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-processor&lt;/span&gt;  
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.5&lt;/span&gt;
    &lt;span class="na"&gt;criticality&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
    &lt;span class="na"&gt;slo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;99.9&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notification&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.2&lt;/span&gt;  
    &lt;span class="na"&gt;criticality&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;medium&lt;/span&gt;
    &lt;span class="na"&gt;slo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;99.5&lt;/span&gt;

&lt;span class="na"&gt;product_slo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;availability&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;99.9&lt;/span&gt;  &lt;span class="c1"&gt;# Calculated from components&lt;/span&gt;
  &lt;span class="na"&gt;latency_p95&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;300ms&lt;/span&gt;  &lt;span class="c1"&gt;# End-to-end latency&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Practical implementation with sloth
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/slok/sloth" rel="noopener noreferrer"&gt;Sloth&lt;/a&gt; generates Prometheus rules from simple SLO definitions. Here's a complete example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# sloth-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prometheus/v1"&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-processor"&lt;/span&gt;
&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments"&lt;/span&gt;
  &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;paymentpro"&lt;/span&gt;

&lt;span class="na"&gt;slos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-requests-availability"&lt;/span&gt;
    &lt;span class="na"&gt;objective&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;99.9&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PaymentPro&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;availability"&lt;/span&gt;
    &lt;span class="na"&gt;sli&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;error_query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(payment_requests_total{job="payment-api",code=~"5.."}[5m]))&lt;/span&gt;
        &lt;span class="na"&gt;total_query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(payment_requests_total{job="payment-api"}[5m]))&lt;/span&gt;
    &lt;span class="na"&gt;alerting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PaymentAvailabilityAlert&lt;/span&gt;
      &lt;span class="na"&gt;page_alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
      &lt;span class="na"&gt;ticket_alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;  
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
    &lt;span class="na"&gt;windows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30m&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2h&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;6h&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1d&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3d&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate Prometheus rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sloth generate &lt;span class="nt"&gt;-i&lt;/span&gt; sloth-config.yaml &lt;span class="nt"&gt;-o&lt;/span&gt; prometheus-rules.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Moving from reactive to proactive
&lt;/h2&gt;

&lt;p&gt;Here's pseudocode for implementing SLO-based decision making:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_deploy_new_feature&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;error_budget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_error_budget_remaining&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;error_budget&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Less than 10% remaining
&lt;/span&gt;        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error budget critical - no deployments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;error_budget&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Less than 30% remaining  
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_critical_security_fix&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error budget low - only critical fixes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

    &lt;span class="c1"&gt;# Healthy error budget
&lt;/span&gt;    &lt;span class="n"&gt;risk_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_deployment_risk&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;budget_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;estimate_error_budget_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;risk_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;budget_cost&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;error_budget&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Use max 10% of remaining
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;requires_architecture_review&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementing SLO-based on-call
&lt;/h2&gt;

&lt;p&gt;Implement your on-call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Alert routing based on burn rate&lt;/span&gt;
&lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;group_by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;alertname'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;team'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Critical: Will exhaust budget in &amp;lt; 6 hours&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;burn_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pagerduty-immediate&lt;/span&gt;

    &lt;span class="c1"&gt;# Warning: Will exhaust budget in &amp;lt; 3 days  &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
        &lt;span class="na"&gt;burn_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;medium&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slack-channel&lt;/span&gt;

    &lt;span class="c1"&gt;# Info: Budget consumption above normal&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;info&lt;/span&gt;
        &lt;span class="na"&gt;burn_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;low&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;email-daily-digest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Negotiating SLAs that don't bankrupt you
&lt;/h2&gt;

&lt;p&gt;When sales promises 99.99% availability without asking engineering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_sla_penalty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sla_target&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Real-world SLA penalty calculation
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;sla_target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="n"&gt;breach_severity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sla_target&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;breach_severity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Minor breach
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;monthly_revenue&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;breach_severity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Major breach
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;monthly_revenue&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt;  
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Severe breach
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;monthly_revenue&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;realistic_sla_target&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;internal_slo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Set SLA based on SLO with safety margin
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;safety_margin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;internal_slo&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;internal_slo&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;safety_margin&lt;/span&gt;

&lt;span class="c1"&gt;# Example: 99.9% SLO -&amp;gt; 99.5% SLA
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Advanced patterns for mature teams
&lt;/h2&gt;

&lt;h3&gt;
  
  
  SLO-based capacity planning
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Predict when you'll hit capacity based on growth
predict_linear(
  sum(rate(payment_requests_total[1d])) by (service)[30d:1d],
  86400 * 30  # 30 days forward
) &amp;gt; bool 1000  # Capacity limit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Composite SLIs for complex user journeys
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Complete payment flow SLI
(
  # User can log in
  sli:auth_success_rate:5m *
  # Payment is processed
  sli:payment_success_rate:5m *  
  # User receives confirmation
  sli:notification_delivery_rate:5m *
  # All within acceptable time
  (sli:end_to_end_latency_p95:5m &amp;lt; bool 2.0)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Error budget policies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;error_budget_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;triggers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;budget_remaining&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;75%&lt;/span&gt;
      &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;normal_deployments&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;feature_experiments&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;budget_remaining&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;50%&lt;/span&gt;
      &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deployments_require_approval&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;no_experiments&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;increase_testing&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;budget_remaining&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;25%&lt;/span&gt;
      &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;only_critical_fixes&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;mandatory_postmortems&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;reliability_sprint_planning&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;budget_remaining&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10%&lt;/span&gt;
      &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deployment_freeze&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;all_hands_reliability&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;executive_escalation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Putting it all together: Your SLO journey
&lt;/h2&gt;

&lt;p&gt;Week 1-2: &lt;strong&gt;Instrument your code&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Add metrics to your service&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;requestDuration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewHistogramVec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HistogramOpts&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"payment_request_duration_seconds"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Help&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Payment request duration"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Buckets&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="m"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;processPayment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"success"&lt;/span&gt;

    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Seconds&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;requestDuration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithLabelValues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;

    &lt;span class="c"&gt;// Payment logic here&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Week 3-4: &lt;strong&gt;Define your first SLOs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with availability (it's easiest)&lt;/li&gt;
&lt;li&gt;Add latency once you understand percentiles&lt;/li&gt;
&lt;li&gt;Don't overthink the targets - you'll adjust them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Week 5-6: &lt;strong&gt;Implement error budgets&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create dashboards showing budget consumption&lt;/li&gt;
&lt;li&gt;Set up burn rate alerts&lt;/li&gt;
&lt;li&gt;Practice saying "no" when budget is low&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Week 7-8: &lt;strong&gt;Automate policies&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integrate with CI/CD&lt;/li&gt;
&lt;li&gt;Create runbooks for budget exhaustion&lt;/li&gt;
&lt;li&gt;Train the team on the new process&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Month 3: &lt;strong&gt;Iterate and improve&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review SLO targets against user feedback&lt;/li&gt;
&lt;li&gt;Adjust based on business needs&lt;/li&gt;
&lt;li&gt;Consider product-level SLOs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The SLI validation framework
&lt;/h2&gt;

&lt;p&gt;Before committing to an SLI, validate it meets these criteria:&lt;/p&gt;

&lt;h3&gt;
  
  
  The VALID test for SLIs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_sli_candidate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    V - Valuable to users
    A - Actionable by the team  
    L - Logically complete
    I - Implementable with current tools
    D - Defensible to stakeholders
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;checks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Valuable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;does_metric_impact_user_experience&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Actionable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;can_team_improve_this_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Logically_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;does_metric_cover_all_failure_modes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Implementable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;can_we_measure_this_accurately&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Defensible&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;can_we_explain_why_this_matters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Strong SLI candidate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Consider with modifications&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Not suitable as SLI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example validation process
&lt;/h3&gt;

&lt;p&gt;Let's validate two potential SLIs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Candidate 1: CPU Usage &amp;lt; 80%&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Valuable? Users don't care about CPU&lt;/li&gt;
&lt;li&gt;Actionable? Can optimize code or scale&lt;/li&gt;
&lt;li&gt;Logically complete? High CPU doesn't always mean problems&lt;/li&gt;
&lt;li&gt;Implementable? Easy to measure&lt;/li&gt;
&lt;li&gt;Defensible? Hard to justify to business
&lt;strong&gt;Verdict&lt;/strong&gt;: Not suitable (2/5)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Candidate 2: Payment success rate &amp;gt; 99.9%&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Valuable? Direct user impact&lt;/li&gt;
&lt;li&gt;Actionable? Can fix bugs, improve retry logic&lt;/li&gt;
&lt;li&gt;Logically complete? Covers all payment failures&lt;/li&gt;
&lt;li&gt;Implementable? Clear success/failure states&lt;/li&gt;
&lt;li&gt;Defensible? Easy to explain revenue impact
&lt;strong&gt;Verdict&lt;/strong&gt;: Strong SLI candidate (5/5)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The SLI hierarchy of needs
&lt;/h3&gt;

&lt;p&gt;Not all SLIs are created equal. Prioritize them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1: Core Functionality (Must have)
├── Service reachable (availability)
├── Basic operations work (success rate)
└── Acceptable performance (latency)

Level 2: Data Integrity (Should have)
├── Data consistency
├── Durability guarantees
└── Reconciliation accuracy

Level 3: User Experience (Nice to have)
├── Feature completeness
├── Cross-service workflows
└── Advanced functionality
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Implement Level 1 SLIs first, then expand based on user feedback and incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your next steps
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start the conversation&lt;/strong&gt; - Schedule interviews with product, development, and support teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit your metrics&lt;/strong&gt; - List everything you currently measure and categorize as infrastructure vs application&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick one service&lt;/strong&gt; - Don't boil the ocean&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement both perspectives&lt;/strong&gt; - Start with infrastructure SLIs, quickly add application SLIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure current performance&lt;/strong&gt; - You need a baseline for both layers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set conservative SLOs&lt;/strong&gt; - You can always tighten later&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement burn rate alerts&lt;/strong&gt; - Replace those threshold alerts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track error budgets&lt;/strong&gt; - Make them visible to everyone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate based on reality&lt;/strong&gt; - SLOs are living documents&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Remember: Perfect is the enemy of good. Your first SLOs will be wrong, and that's okay. The goal isn't perfection; it's continuous improvement based on user outcomes rather than system metrics.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Soumyajyoti Mahalanobish</dc:creator>
      <pubDate>Wed, 14 May 2025 16:55:53 +0000</pubDate>
      <link>https://dev.to/soumyajyoti-devops/-4d9b</link>
      <guid>https://dev.to/soumyajyoti-devops/-4d9b</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/soumyajyoti-devops/managing-shared-environments-with-grace-2f5i" class="crayons-story__hidden-navigation-link"&gt;Managing Shared Environments with Grace&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/soumyajyoti-devops" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3138245%2Fa0497aef-784e-49d7-b94a-126c4f8fd20a.png" alt="soumyajyoti-devops profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/soumyajyoti-devops" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Soumyajyoti Mahalanobish
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Soumyajyoti Mahalanobish
                
              
              &lt;div id="story-author-preview-content-2488502" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/soumyajyoti-devops" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3138245%2Fa0497aef-784e-49d7-b94a-126c4f8fd20a.png" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Soumyajyoti Mahalanobish&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/soumyajyoti-devops/managing-shared-environments-with-grace-2f5i" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 14 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/soumyajyoti-devops/managing-shared-environments-with-grace-2f5i" id="article-link-2488502"&gt;
          Managing Shared Environments with Grace
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/soumyajyoti-devops/managing-shared-environments-with-grace-2f5i" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;1&lt;span class="hidden s:inline"&gt; reaction&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/soumyajyoti-devops/managing-shared-environments-with-grace-2f5i#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              1&lt;span class="hidden s:inline"&gt; comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            4 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>sre</category>
      <category>workplace</category>
      <category>softwaredevelopment</category>
      <category>devops</category>
    </item>
    <item>
      <title>Managing Shared Environments with Grace</title>
      <dc:creator>Soumyajyoti Mahalanobish</dc:creator>
      <pubDate>Wed, 14 May 2025 16:55:34 +0000</pubDate>
      <link>https://dev.to/soumyajyoti-devops/managing-shared-environments-with-grace-2f5i</link>
      <guid>https://dev.to/soumyajyoti-devops/managing-shared-environments-with-grace-2f5i</guid>
      <description>&lt;h2&gt;
  
  
  1. The Problem: A Single Test Environment, Too Many Developers
&lt;/h2&gt;

&lt;p&gt;Every day, developers from multiple squads need access to a shared environment for staging deployments and end‑to‑end tests. Conflicts are routine. People hop into the environment mid‑session or, worse, forget to release it.&lt;/p&gt;

&lt;p&gt;Imagine this: Developer A is debugging a flaky integration test. Mid‑way, Developer B deploys a different branch, unaware that the environment is in use. Chaos ensues. Slack threads grow long. Tempers rise. No one's to blame, but everyone's affected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia2.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExMXJ3dnFlOG9kOWd6dzduaHdmc2pkcW1zbDIydmxtZDU2dHBnMnFtaSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw%2FfbViHHmKsUYRCG7oa4%2Fgiphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia2.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExMXJ3dnFlOG9kOWd6dzduaHdmc2pkcW1zbDIydmxtZDU2dHBnMnFtaSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw%2FfbViHHmKsUYRCG7oa4%2Fgiphy.gif" width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This scenario calls for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A way to reserve the environment&lt;/li&gt;
&lt;li&gt;Visibility into who's using it&lt;/li&gt;
&lt;li&gt;Notifications for the next person in line&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. The Idea: A Self‑Service Queue with Visibility
&lt;/h2&gt;

&lt;p&gt;This prototype envisions a lean and elegant system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers click to join the queue&lt;/li&gt;
&lt;li&gt;When their turn comes, they get notified on Slack&lt;/li&gt;
&lt;li&gt;They can lock the environment with a single click&lt;/li&gt;
&lt;li&gt;A reason is recorded to inform others of usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No separate access control, no complicated UIs—just a small, robust system that integrates into developer tools and Slack.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. System Architecture
&lt;/h2&gt;

&lt;p&gt;The architecture comprises:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Twirp‑powered Go backend – lightweight RPCs over HTTP&lt;/li&gt;
&lt;li&gt;Redis – queue state and locking with TTL&lt;/li&gt;
&lt;li&gt;Slack – direct user notifications&lt;/li&gt;
&lt;li&gt;Frontend – embedded in internal dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1yi5la73gc50ikuirs3d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1yi5la73gc50ikuirs3d.png" alt="Minimal-Architecture" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Redis stores per‑application queues as lists (e.g., &lt;code&gt;queue:env:shared&lt;/code&gt;) and lock state as a hash (e.g., &lt;code&gt;lock:env:shared&lt;/code&gt;). Queue position, TTLs, and Slack IDs are all stored as structured JSON entries.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Protobuf Definitions
&lt;/h2&gt;

&lt;p&gt;Interfaces are defined via &lt;code&gt;buf.gen.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight protobuf"&gt;&lt;code&gt;&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;QueueEntry&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;user_email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;user_name&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;slack_id&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int64&lt;/span&gt;  &lt;span class="na"&gt;timestamp&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;JoinQueueRequest&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt;     &lt;span class="na"&gt;app_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;QueueEntry&lt;/span&gt; &lt;span class="na"&gt;entry&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;LeaveQueueRequest&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;app_name&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;user_email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;LockState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;user_email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int64&lt;/span&gt;  &lt;span class="na"&gt;timestamp&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int64&lt;/span&gt;  &lt;span class="na"&gt;expires_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;service&lt;/span&gt; &lt;span class="n"&gt;Deployments&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;rpc&lt;/span&gt; &lt;span class="n"&gt;JoinQueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;JoinQueueRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;JoinQueueResponse&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;rpc&lt;/span&gt; &lt;span class="n"&gt;LeaveQueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LeaveQueueRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LeaveQueueResponse&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;rpc&lt;/span&gt; &lt;span class="n"&gt;GetQueueStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GetQueueStatusRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GetQueueStatusResponse&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Go Implementation Highlights
&lt;/h2&gt;

&lt;p&gt;The backend implements Twirp RPCs and Redis atomic operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Joining the Queue
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Marshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TxPipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LRange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queueKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RPush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queueKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cmds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;cmds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StringSliceCmd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Val&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Entry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UserEmail&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"already in queue"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Promoting Lock on Leave
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TxPipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LRem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queueKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;leavingEntry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queueKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;nextEntryJSON&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StringCmd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Val&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;nextEntryJSON&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;nextEntry&lt;/span&gt; &lt;span class="n"&gt;deployment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QueueEntry&lt;/span&gt;
  &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unmarshal&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nextEntryJSON&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;nextEntry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HSet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{}{&lt;/span&gt;
    &lt;span class="s"&gt;"user_email"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;nextEntry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UserEmail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"reason"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;nextEntry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"timestamp"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unix&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="n"&gt;notifySlack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nextEntry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. Slack Integration
&lt;/h2&gt;

&lt;p&gt;Slack alerts are essential to ensure seamless transitions. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4q39v430tbx9tmoq9ha9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4q39v430tbx9tmoq9ha9.png" alt="Slack-Bot" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notifications are enhanced with Block Kit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;BlockElement&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;TextBlock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"You are now first in the test queue! Reason: *%s*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reason&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
  &lt;span class="n"&gt;Button&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Take Lock"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockURL&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;SendSlackMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SlackId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Retries and exponential back‑off are handled via a wrapper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sendToSlackAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;break&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Frontend Integration
&lt;/h2&gt;

&lt;p&gt;A single button toggles lock intent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handleToggle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`/api/toggle-lock`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nf"&gt;refreshQueue&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;UX states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not in queue → Join&lt;/li&gt;
&lt;li&gt;First in queue → Take lock&lt;/li&gt;
&lt;li&gt;Has lock → Release and notify next&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real‑time updates are supported via Redis pub/sub.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Testing: Critical for Internal Developer Platforms
&lt;/h2&gt;

&lt;p&gt;Thorough testing is essential for internal developer platforms for several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trust and Adoption&lt;/strong&gt;: Developers will only use tools they can rely on. A flaky queue management system will be abandoned quickly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure Amplification&lt;/strong&gt;: A bug in a developer platform affects the productivity of entire engineering teams, not just individual users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex State Transitions&lt;/strong&gt;: Queue management involves multiple state transitions that must be tested rigorously to prevent deadlocks or lost queue positions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed Consistency&lt;/strong&gt;: Ensuring consistent state between Redis, the application, and Slack notifications requires robust integration testing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error Recovery&lt;/strong&gt;: Systems must be resilient to network issues, Redis failures, or Slack API outages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unit + integration tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestLockTransition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c"&gt;// Test Redis LRem, HSet, LIndex chain&lt;/span&gt;
  &lt;span class="c"&gt;// Validate Slack calls were triggered&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestQueueConcurrency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c"&gt;// Validate queue integrity with concurrent join/leave operations&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestSlackNotificationRetries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c"&gt;// Ensure backoff strategy properly handles temporary Slack outages&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This setup demonstrates how simple components like Redis, Slack, and Go can come together to orchestrate shared environment access with minimal friction. The queue model introduces fairness and predictability to what was once a chaotic workflow. If your developers are wrestling for staging access, it might be time to queue up—with grace.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The 3 AM Alert That Wasn't Actually a Problem</title>
      <dc:creator>Soumyajyoti Mahalanobish</dc:creator>
      <pubDate>Thu, 08 May 2025 16:03:55 +0000</pubDate>
      <link>https://dev.to/soumyajyoti-devops/the-3-am-alert-that-wasnt-actually-a-problem-1ah5</link>
      <guid>https://dev.to/soumyajyoti-devops/the-3-am-alert-that-wasnt-actually-a-problem-1ah5</guid>
      <description>&lt;p&gt;&lt;em&gt;By Soumyajyoti, Senior Software Engineer @ ProtectAI&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia4.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExbnU2NmJxOXp0cnhmb2ZhaG1qYXdsb2Y3cDdoNmw1NWoxODBicGh6diZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw%2F5qoRdabXeT4GY%2Fgiphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia4.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExbnU2NmJxOXp0cnhmb2ZhaG1qYXdsb2Y3cDdoNmw1NWoxODBicGh6diZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw%2F5qoRdabXeT4GY%2Fgiphy.gif" width="439" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's 3 AM. Your phone buzzes aggressively with an alert notification:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;DatasourceNoData | 𝗙𝗜𝗥𝗜𝗡𝗚&lt;/strong&gt;&lt;br&gt;
→ Memory usage of [no value] in [no value] namespace is at %!f(string=)% (&amp;gt;80%)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Half-asleep, you grab your laptop, VPN into your infrastructure, and... everything's fine. No pods are crashing, no services are down, and CPU/memory usage across the cluster is normal. You've just been woken up by a &lt;em&gt;false alert&lt;/em&gt; caused by a temporary blip in your monitoring system.&lt;/p&gt;

&lt;p&gt;If this sounds familiar, you're not alone. Many Kubernetes operators and SREs face this exact challenge: how do you create robust monitoring alerts that catch real issues but don't wake you up for temporary glitches?&lt;/p&gt;

&lt;p&gt;In this post, I'll share practical techniques our team developed for building reliable Grafana alerts for Kubernetes environments that strike the perfect balance between sensitivity and resilience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Costs of Alert Fatigue
&lt;/h2&gt;

&lt;p&gt;Before diving into solutions, let's acknowledge why this problem matters.&lt;/p&gt;

&lt;p&gt;False alerts aren't just annoying—they're expensive:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engineer time and focus:&lt;/strong&gt; Each unnecessary interruption costs ~23 minutes of recovery time to get back to productive work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diminished alert credibility:&lt;/strong&gt; Teams start ignoring alerts after too many false positives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SRE burnout:&lt;/strong&gt; Nobody wants to be the on-call engineer for a system that cries wolf&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At our organization, we discovered that over 60% of our after-hours alerts were false positives caused by monitoring system hiccups rather than actual infrastructure issues. That's why we invested time in refining our alerting strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Pillars of Reliable Kubernetes Alerts
&lt;/h2&gt;

&lt;p&gt;Through trial and error, we've identified four key strategies that have dramatically reduced our false-positive rate while maintaining our ability to catch real issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Targeting the Right Workloads: Filtering for What Matters
&lt;/h3&gt;

&lt;p&gt;When monitoring a Kubernetes cluster with dozens of namespaces and hundreds of deployments, alerting on everything is a recipe for noise. Instead, focus on what truly matters.&lt;/p&gt;

&lt;p&gt;Here's how we built targeted CPU utilization alerts for critical deployments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# CPU usage as percentage of limits for critical containers
sum by(namespace, container) (
  rate(container_cpu_usage_seconds_total{container=~"app-api|myworker"}[5m])
) / (
  sum by(namespace, container) (
    kube_pod_container_resource_limits{resource="cpu"}
  )
) * 100
or vector(0)  # Prevent "no data" scenarios
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query targets specific containers that your application depends on. The &lt;code&gt;or vector(0)&lt;/code&gt; ensures the query always returns data, even when no matches are found.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Beyond Simple Thresholds: Intelligent Alert Conditions
&lt;/h3&gt;

&lt;p&gt;Alerting when any pod hits 80% CPU for a split second will bury you in notifications. Instead, use these techniques for more intelligent conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add time requirements:&lt;/strong&gt; Require conditions to persist for a meaningful period&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider rate of change:&lt;/strong&gt; Alert on rapid increases, not just absolute values&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use contextual thresholds:&lt;/strong&gt; Different workloads have different "normal" ranges&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In our Grafana configuration, we set a 2-minute "Pending period" before firing alerts:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj01gvaplpydfekj3fa5r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj01gvaplpydfekj3fa5r.png" alt="Image description" width="800" height="285"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This means a threshold must be exceeded continuously for 2 minutes before an alert fires, eliminating most transient spikes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The &lt;code&gt;or vector(0)&lt;/code&gt; Pattern: Ensuring Queries Always Return Data
&lt;/h3&gt;

&lt;p&gt;One of the most frustrating Grafana alert issues occurs when your query returns no data, resulting in &lt;code&gt;[no value]&lt;/code&gt; placeholders in alert notifications. This commonly happens when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics are temporarily unavailable&lt;/li&gt;
&lt;li&gt;The data source connection experiences a brief hiccup&lt;/li&gt;
&lt;li&gt;A label selector matches nothing (e.g., a pod is no longer running)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The solution? Add &lt;code&gt;or vector(0)&lt;/code&gt; to your PromQL queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Without vector(0) - can lead to "no data" alerts:
sum(kube_pod_status_phase{namespace="production", phase="Pending"})

# With vector(0) - always returns data:
sum(kube_pod_status_phase{namespace="production", phase="Pending"}) or vector(0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern ensures your query always returns values (with zeroes when no match), preventing those cryptic &lt;code&gt;[no value]&lt;/code&gt; notifications.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The Secret Weapon: Proper No-Data and Error Handling
&lt;/h3&gt;

&lt;p&gt;Even with perfect queries, temporary issues with Prometheus or Grafana can cause alert evaluation to fail. That's where Grafana's built-in error handling comes in.&lt;/p&gt;

&lt;p&gt;In your alert configuration, find the "Configure no data and error handling" section and set:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"Alert state if no data or all values are null" to "Normal"&lt;/li&gt;
&lt;li&gt;"Alert state if execution error or timeout" to "Normal"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5kanf86ioz5o19pfjp9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5kanf86ioz5o19pfjp9.png" alt="Image description" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This configuration tells Grafana to treat temporary data issues as normal conditions rather than alerting triggers. It's like telling your alert system, "If you're not sure, don't wake me up."&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Sleep Better with Robust Alerting
&lt;/h2&gt;

&lt;p&gt;Remember that alert tuning is an iterative process. Start with your most problematic alerts, implement these patterns, and observe the results before proceeding to others.&lt;/p&gt;

&lt;p&gt;By implementing this, you'll create a monitoring system that respects your team's time and attention while still providing the critical safety net you need for production systems.&lt;/p&gt;

&lt;p&gt;The next time you're on call, you might actually get some sleep!&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
