DEV Community: Soumyajyoti Mahalanobish

[Boost]

Soumyajyoti Mahalanobish — Tue, 01 Jul 2025 07:57:38 +0000

Soumyajyoti Mahalanobish

Jul 1 '25

Monitoring Celery Workers with Flower: Your Tasks Need Babysitting

#opensource #python #monitoring #sitereliabilityengineering

Comments

8 min read

Monitoring Celery Workers with Flower: Your Tasks Need Babysitting

Soumyajyoti Mahalanobish — Tue, 01 Jul 2025 07:57:22 +0000

So you've got Celery workers happily executing tasks in your Kubernetes cluster, but you're flying blind. Your workers could be on fire, stuck in an endless queue, and you'd be the one to blame here.

We've all been there

where we're staring at logs hoping to divine the health of our distributed systems. Time to set up some proper monitoring.

Celery is one way of doing distributed task processing, but it's opaque when it comes to observability. You can see logs, but logs don't tell you if workers are healthy, how long tasks are taking, or whether your queue is backing up. That's where Flower comes in, it's the one of the monitoring tools for Celery environments.

This guide covers integrating Flower with Prometheus and Grafana to get proper metrics-driven monitoring. Whether you're using Grafana Cloud, self-hosted Grafana, the k8s-monitoring Helm chart, or individual components, we'll walk through the setup, explain why each piece matters, and tackle the gotchas.

What You'll Need

Kubernetes cluster with Celery workers already running
Some form of Prometheus-compatible metrics collection (Alloy, Prometheus Operator, plain Prometheus, etc.)
Grafana instance (cloud or self-hosted)
Basic Kubernetes knowledge
Patience for the inevitable configuration mysteries

Understanding the Architecture

Before diving into configuration, let's understand what we're building. Flower sits between your Celery workers and your monitoring system. It connects to your message broker (Redis/RabbitMQ), watches worker activity, and exposes metrics in Prometheus format.

The flow looks like this:

Celery workers process tasks from the broker
Flower monitors the broker and worker activity
Flower exposes metrics at /metrics endpoint
Your metrics collector (Prometheus/Alloy) scrapes these metrics
Grafana visualizes the data

The key insight is that Flower doesn't directly monitor workers, it monitors the broker's state and worker events, which is why it can give you a complete picture of your distributed system.

The Setup: Flower with Prometheus Metrics

Here's the thing about Flower, it's great at showing you pretty graphs in its web UI, but getting it to export metrics for Prometheus requires a specific flag that's easy to miss. By default, Flower only exposes basic Python process metrics, which are useless for understanding your Celery workload.

Deploy Flower (the Right Way)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flower
  labels:
    app: flower
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flower
  template:
    metadata:
      labels:
        app: flower
    spec:
      containers:
      - name: flower
        image: mher/flower:latest
        ports:
        - containerPort: 5555
        env:
        - name: CELERY_BROKER_URL
          value: "redis://your-redis-service:6379/0"
        command: 
        - celery
        - flower
        - --broker=redis://your-redis-service:6379/0
        - --port=5555
        - --prometheus_metrics  # This is the magic flag

That --prometheus_metrics flag is doing the heavy lifting here. Without it, you'll get basic Python process metrics (memory usage, GC stats, etc.) but none of the Celery-specific goodness like worker status, task counts, or queue depths. This flag tells Flower to export its internal monitoring data in Prometheus format.

The broker URL needs to match exactly what your Celery workers are using. Flower connects to the same broker to observe worker activity and task flow. If there's a mismatch, Flower won't see your workers.

Service Configuration

apiVersion: v1
kind: Service
metadata:
  name: flower-service
  labels:
    app: flower
spec:
  selector:
    app: flower
  ports:
  - name: metrics  # Named ports help with service discovery
    port: 5555
    targetPort: 5555

The named port (metrics) is crucial for ServiceMonitor configurations later. Many monitoring setups rely on port names rather than numbers for service discovery, making your configuration more resilient to port changes.

Metrics Collection: Choose Your Adventure

How you get these metrics into your monitoring system depends entirely on your infrastructure setup. Kubernetes monitoring has evolved into several different patterns, each with its own tradeoffs.

Option 1: ServiceMonitor (Prometheus Operator/k8s-monitoring)

ServiceMonitors are part of the Prometheus Operator ecosystem and provide declarative configuration for scrape targets. They're the cleanest approach if you're using Prometheus Operator or the k8s-monitoring Helm chart.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: flower-metrics
  labels:
    app: flower
spec:
  selector:
    matchLabels:
      app: flower
  endpoints:
  - port: metrics      # References the named port
    path: /metrics
    interval: 30s
    scrapeTimeout: 20s
  namespaceSelector:
    matchNames:
    - your-namespace

The critical detail here is port: metrics vs targetPort: metrics. ServiceMonitors reference the service's port definition, not the container port directly. This indirection allows you to change container ports without updating monitoring configs.

Getting this configuration right requires the same attention to detail as any other infrastructure code.

One character difference can mean the difference between working monitoring and hours of debugging.

Here, the namespaceSelector restricts which namespaces this ServiceMonitor applies to. Without it, the ServiceMonitor tries to find matching services across all namespaces, which can cause confusion in multitenant clusters.

Option 2: Prometheus Annotations

If you're using vanilla Prometheus with annotation based discovery, you configure scraping through service annotations. This is simpler but less flexible than ServiceMonitors.

apiVersion: v1
kind: Service
metadata:
  name: flower-service
  labels:
    app: flower
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "5555"
    prometheus.io/path: "/metrics"
spec:
  selector:
    app: flower
  ports:
  - name: metrics
    port: 5555
    targetPort: 5555

The annotations tell Prometheus to scrape this service. Your Prometheus configuration needs to include a job that discovers services with these annotations. This approach is more straightforward but offers less control over scraping behavior.

Option 3: Alloy Configuration (Manual)

Grafana Alloy offers more flexibility than traditional Prometheus. You can configure complex discovery and relabeling rules to handle dynamic environments.

# Add to your Alloy config
discovery.kubernetes "flower_pods" {
  role = "pod"
  selectors {
    role = "pod"
    label = "app=flower"
  }
}

discovery.relabel "flower_pods" {
  targets = discovery.kubernetes.flower_pods.targets
  rule {
    source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"]
    action = "keep"
    regex = "true"
  }
  rule {
    source_labels = ["__address__", "__meta_kubernetes_pod_annotation_prometheus_io_port"]
    action = "replace"
    regex = "([^:]+)(?:\\d+)?;(\\d+)"
    replacement = "${1}:${2}"
    target_label = "__address__"
  }
}

prometheus.scrape "flower_metrics" {
  targets    = discovery.relabel.flower_pods.output
  forward_to = [prometheus.remote_write.your_destination.receiver]
  scrape_interval = "30s"
}

This configuration discovers pods with the app=flower label, applies relabeling rules to construct proper scrape targets, and forwards metrics to your storage backend. The relabeling rules transform Kubernetes metadata into the format Prometheus expects.

Option 4: Static Prometheus Config

For simple setups or development environments, static configuration is the most straightforward approach.

# prometheus.yml
scrape_configs:
  - job_name: 'flower'
    static_configs:
      - targets: ['flower-service.your-namespace.svc.cluster.local:5555']
    metrics_path: '/metrics'
    scrape_interval: 30s

This hardcodes the service endpoint, which works fine for stable environments but doesn't handle dynamic scaling or service changes gracefully.

Verification: Making Sure It Actually Works

Before diving into dashboard creation, verify that metrics are flowing correctly. This saves hours of troubleshooting later when you're wondering why your graphs are empty.

Check the Metrics Endpoint

kubectl port-forward svc/flower-service 5555:5555
curl http://localhost:5555/metrics

You should see metrics that look like this:

flower_worker_online{worker="celery@worker-1"} 1.0
flower_events_total{task="process_data",type="task-sent"} 127.0
flower_worker_number_of_currently_executing_tasks{worker="celery@worker-1"} 3.0
flower_task_prefetch_time_seconds{task="process_data",worker="celery@worker-1"} 0.001

If you're only seeing basic Python metrics (python_gc_objects_collected_total, process_resident_memory_bytes, etc.), you're missing the --prometheus_metrics flag. The Celery-specific metrics are what make this whole exercise worthwhile.

Check Your Monitoring System

The verification process depends on your monitoring setup:

For ServiceMonitor setups: Check the Prometheus Operator or Alloy UI for discovered targets. Look for your Flower service in the targets list with status "UP".

For annotation-based: Navigate to your Prometheus targets page (/targets) and verify the Flower job appears with healthy status.

For manual configs: Check your collector's logs for any scraping errors and verify the target appears in the monitoring system's target list.

Understanding the Metrics

Flower exports several categories of metrics, each providing different insights into your Celery system:

Worker Metrics: flower_worker_online tells you which workers are active. flower_worker_number_of_currently_executing_tasks shows current load per worker.

Event Metrics: flower_events_total tracks task lifecycle events (sent, received, started, succeeded, failed). These form the basis for throughput and success rate calculations.

Timing Metrics: flower_task_runtime_seconds (histogram) shows task execution duration. flower_task_prefetch_time_seconds measures queue wait time.

Queue Metrics: Various metrics help you understand queue depth and processing patterns.

Building Useful Dashboards

Now for the payoff - turning those metrics into actionable insights. The key is building dashboards that help you answer specific operational questions.

Essential Queries

Worker Health Questions: "Are my workers running? How many are active?"

# Total online workers
sum(flower_worker_online)

# Per-worker status
flower_worker_online

Throughput Questions: "How many tasks are we processing? Is throughput increasing?"

# Tasks being sent to workers (per second)
rate(flower_events_total{type="task-sent"}[5m])

# Tasks being processed (per second)
rate(flower_events_total{type="task-received"}[5m])

Queue Health Questions: "Is my queue backing up? How long do tasks wait?"

# Tasks currently executing
sum(flower_worker_number_of_currently_executing_tasks)

# Time tasks spend waiting in queue
flower_task_prefetch_time_seconds

Performance Questions: "How long do tasks take? Are they getting slower?"

# 95th percentile task duration
histogram_quantile(0.95, rate(flower_task_runtime_seconds_bucket[5m]))

# Median task duration
histogram_quantile(0.50, rate(flower_task_runtime_seconds_bucket[5m]))

Dashboard Design Philosophy specifically for celery

Start with high-level health indicators, then provide drill-down capabilities. A good Celery dashboard answers these questions in order:

System Health: Are workers running? Is the system processing tasks?
Throughput: How much work are we doing? Is it increasing or decreasing?
Performance: How fast are tasks completing? Are there performance regressions?
Queue Health: Are tasks backing up? Where are the bottlenecks?

Scaling Considerations

Multiple Worker Types

Real Celery deployments often have specialized workers for different task types. CPU-intensive tasks, I/O-bound tasks, and priority queues all need separate monitoring.

# CPU-intensive work monitor
command: ["celery", "-A", "tasks.cpu", "flower", "--port=5555", "--prometheus_metrics"]

# I/O-bound work monitor  
command: ["celery", "-A", "tasks.io", "flower", "--port=5555", "--prometheus_metrics"]

Each Flower instance monitors a specific Celery app, giving you granular visibility into different workload types. You'll need separate services and scrape configurations for each instance.

This approach lets you set different SLAs and alerting thresholds for different workload types. Your real-time fraud detection tasks might need sub-second response times, while your batch report generation can tolerate longer delays.

Resource Allocation

Flower itself is lightweight, but its resource needs scale with worker count and task frequency. A busy system with hundreds of workers and thousands of tasks per minute will use more memory to track state.

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Self-Hosted Prometheus

For self-hosted setups, configure Grafana to read from your Prometheus instance:

# grafana datasource config
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus-server.monitoring.svc.cluster.local:9090
    access: proxy

This assumes Prometheus and Grafana are in the same cluster. For cross-cluster or external access, you'll need appropriate networking and authentication configuration.

Security Considerations

Production Flower deployments need proper security controls. Flower's web interface shows detailed information about your task processing, which could be sensitive.

Authentication

Enable basic authentication at minimum:

env:
- name: FLOWER_BASIC_AUTH
  value: "admin:your_secure_password"

For production systems, consider OAuth integration or running Flower behind an authentication proxy. Celery-exporter provides similar metrics without the web interface overhead. It's purpose-built for Prometheus integration and might use fewer resources than Flower. However, you lose Flower's web interface for ad-hoc investigation.

Caution!

Getting Celery monitoring right requires attention to several key details:

The --prometheus_metrics flag transforms Flower from a simple web interface into a proper metrics exporter
Your metrics collection method should match your infrastructure setup and operational preferences
ServiceMonitor port configuration matters - port references service ports, targetPort references container ports
Label matching between ServiceMonitors, services, and pods must be exact
Your monitoring system's target discovery UI is invaluable for debugging configuration issues

The setup might seem complicated, but each piece serves a specific purpose in building a robust monitoring system. Once you have this foundation, you can extend it with alerting rules, additional dashboards, and integration with your incident response workflow.

[Boost]

Soumyajyoti Mahalanobish — Thu, 05 Jun 2025 05:16:27 +0000

Soumyajyoti Mahalanobish

Jun 5 '25

A Young Engineer's Guide to SLIs, SLOs, SLAs

Comments

16 min read

This one's a long one 👀

Soumyajyoti Mahalanobish — Thu, 05 Jun 2025 05:08:00 +0000

Soumyajyoti Mahalanobish

Jun 5 '25

A Young Engineer's Guide to SLIs, SLOs, SLAs

Comments

16 min read

A Young Engineer's Guide to SLIs, SLOs, SLAs

Soumyajyoti Mahalanobish — Thu, 05 Jun 2025 05:06:16 +0000

If you're an early-career engineer drowning in observability dashboards, wrestling with Prometheus queries in Grafana Explore page, and wondering what exatly lies beyond your observability stack setup, this guide is for you. This is a primer to proactive reliability engineering using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).

Let's assume a fictional SaaS platform - PaymentPro, a payment processing service - and we will learn how to implement proper step by step reliability practices.

But before diving into PromQL queries and YAML configs, let's clarify what these acronyms actually mean:

SLIs (Service Level Indicators) are the metrics that matter to your users. Instead of CPU usage, think "percentage of payments processed successfully." They're quantitative measurements of service behavior as experienced by users.

SLOs (Service Level Objectives) are your internal promises. "We aim to process 99.9% of payments successfully" - that's an SLO. It's what you're shooting for, not what you're contractually obligated to deliver.

SLAs (Service Level Agreements) are the legal promises you make to customers, usually with financial consequences. "We guarantee 99.5% payment success rate or you get service credits" - that's an SLA. Always set your SLAs looser than your SLOs to give yourself breathing room.

Think of it this way:

SLI = What you measure
SLO = What you promise yourself
SLA = What you promise customers (with lawyers involved)

The two faces of SLIs: Infrastructure vs Application

Here's what many engineers miss: SLIs exist at multiple layers. You need both infrastructure SLIs and application functionality SLIs to get the complete picture.

Infrastructure SLIs answer the question: "Is my service reachable and responsive?"

API endpoint availability
Response time
Error rates
Throughput

Application Functionality SLIs answer the question: "Is my service doing what users expect?"

Business transactions completed successfully
Data consistency maintained
User workflows functioning correctly
Business rules properly enforced

For PaymentPro, here's the difference:

# Infrastructure SLI: "Can users reach the payment API?"
sum(rate(http_requests_total{status!~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))

# Application SLI: "Can users actually complete payments?"
sum(rate(payments_completed_total{status="success"}[5m])) /
sum(rate(payments_initiated_total[5m]))

The API might be up (infrastructure ✓) but payments could be failing due to a third-party service issue (application ✗).

The SLI discovery process: Just ask questions

Before you write a single PromQL query, you need to understand what actually matters to your users and business. Here's a framework for discovering the right SLIs through conversations with different stakeholders.

Questions for product folks

"What does success look like for a user of this service?"
- Expected answer: "Users can process payments quickly and reliably"
- Your follow-up: "Define 'quickly' - is that 200ms? 2 seconds? What's the threshold where users complain?"
"What are the critical user journeys?"
- Map out step-by-step what users do
- Identify which steps are mandatory vs optional
- Understand dependencies between steps
"What complaints do we get from users?"
- Look for patterns: "payments are slow" → latency SLI
- "I can't see my payment history" → data availability SLI
- "payments sometimes disappear" → consistency SLI
"What would make a customer leave us for a competitor?"
- This reveals your true SLAs
- Often different from what engineering thinks

Questions to developers

"What keeps you up at night about this service?"

   Developer: "The payment webhook system. If it fails, merchants don't get notifications."
   You: "How often does it fail? How do we know when it fails?"

"What are the critical code paths?"
- Have them trace through the code
- Identify external dependencies
- Understand retry logic and failure modes
"What metrics do you check during an incident?"
- These are your SLI candidates
- Ask: "Why this metric and not others?"
"What's the difference between the service being 'up' vs 'working correctly'?"
- This reveals the gap between infrastructure and application SLIs

Questions for customer support

"What are the top 3 issues customers report?"
- Quantify: how many tickets per week?
- Severity: how angry are customers?
"How do you know when something is wrong before customers complain?"
- They often have informal monitoring
- These instincts can become SLIs

The metric audit worksheet

Create a spreadsheet during discovery:

Metric	Source	Type	User Impact	Measurable?	Candidate?
Payment success rate	Devs	Application	Direct - payment fails	Yes - payment_status	✓
API latency	Ops	Infrastructure	Indirect - UX degraded	Yes - histogram	✓
Database replication lag	Ops	Infrastructure	Indirect - stale data	Yes - seconds_behind	Maybe
Background job queue depth	Devs	Application	Delayed - notifications late	Yes - queue_size	✓
Memory usage	Monitoring	Infrastructure	None until OOM	Yes - percentage	✗

The SLI selection framework

Not every metric should become an SLI. Here's a decision framework to evaluate each candidate:

Step 1: Does it have user impact?

Users directly experience this when it fails
It's purely internal with no user visibility
If no → Stop here, it's not an SLI

Step 2: Can we measure it reliably?

We have consistent, accurate data
Data is sporadic, estimated, or manually collected
If no → Stop here, find a measurable proxy

Step 3: Is it actionable?

When it degrades, we know what to fix
It's just an interesting number with no clear response
If no → Stop here, it won't drive improvements

Step 4: Is it a cause or just a symptom?

It's the root cause of user issues
It's a symptom of deeper problems
If symptom → Consider tracking the underlying cause instead

Step 5: Does it cover critical user journeys?

It represents core functionality users depend on
It's a nice-to-have feature
If yes → Strong SLI candidate!

How this looks in practice:

# Many teams codify this decision tree for consistency
def should_be_sli(metric):
    # Step 1: User impact
    if not metric.has_user_impact:
        return False, "No direct user impact"

    # Step 2: Measurability
    if not metric.is_measurable:
        return False, "Cannot measure reliably"

    # Step 3: Actionability
    if not metric.is_actionable:
        return False, "No clear action when it degrades"

    # Step 4: Is it a symptom or a cause?
    if metric.is_symptom:
        # Symptoms can be indicators, but prefer causes
        return Maybe, "Consider the underlying cause instead"

    # Step 5: Coverage
    if metric.covers_critical_user_journey:
        return True, "Covers critical functionality"

    return Maybe, "Evaluate against other candidates"

Real-world example: Payment service SLI discovery

Let's walk through discovering SLIs for our PaymentPro service:

Session with Product Manager:

You: "What does success look like for payment processing?"
PM: "Merchants can accept payments 24/7, funds arrive quickly, and they get real-time notifications."

You: "Let's break that down. What does '24/7' mean exactly?"
PM: "The API should always accept payment attempts. Even if a payment fails due to insufficient funds, the API itself should respond."

You: "How quickly should funds arrive?"
PM: "For card payments, authorization within 2 seconds. Settlement is daily batch, not real-time."

You: "What about notifications?"
PM: "Webhooks should fire within 30 seconds of payment completion. Merchants build their whole flow around these."

Extracted SLI candidates:

API availability (infrastructure)
Payment authorization latency (application)
Webhook delivery time (application)
Webhook delivery success rate (application)

Session with Lead Developer:

You: "Walk me through a payment request."
Dev: "Request hits API gateway → auth service → fraud check → payment processor → webhook queue → notification service."

You: "What can go wrong at each step?"
Dev: "Gateway: rare issues. Auth: tokens expire. Fraud check: sometimes times out. Payment processor: most failures here. Webhook: queue can back up."

You: "How do you know the payment actually processed?"
Dev: "We check the payment_status in the database, but also reconcile with the processor daily."

Additional SLI candidates:

Payment reconciliation accuracy (application)
Fraud check availability (infrastructure)
Queue processing time (application)

Categorizing your SLIs

Once you have candidates, organize them into three main categories:

Infrastructure SLIs - The foundation layer:

API Availability: Is the payment API responding to requests?
- Metric: HTTP request success rate
- Threshold: Less than 0.1% 5xx errors
API Latency: How fast does the API respond?
- Metric: Response time distribution
- Threshold: 95th percentile under 200ms

Application SLIs - The business logic layer:

Payment Success Rate: Are payments completing successfully?
- Metric: Ratio of completed vs initiated payments
- Threshold: 99.9% success rate
Webhook Delivery: Are merchants getting notifications?
- Metric: Time from payment to webhook delivery
- Threshold: 99th percentile under 30 seconds
Payment Reconciliation: Do our records match the payment processor?
- Metric: Daily reconciliation mismatches
- Threshold: Zero tolerance for mismatches

Business SLIs - The customer experience layer:

Merchant Activation Time: How quickly can new merchants start accepting payments?
- Metric: Time from signup to first successful payment
- Threshold: 90% activated within 24 hours

How teams typically document this:

# sli-inventory.yaml - Many teams maintain their SLI catalog in version control
sli_inventory:
  infrastructure:
    - name: api_availability
      description: "Payment API responding to requests"
      metric: "http_requests_total"
      threshold: "status < 500"

    - name: api_latency  
      description: "API response time"
      metric: "http_request_duration_seconds"
      threshold: "p95 < 200ms"

  application:
    - name: payment_success_rate
      description: "Payments completed successfully"
      metric: "payment_status"
      threshold: "status = 'completed'"

    - name: webhook_delivery
      description: "Webhooks delivered within SLA"
      metric: "webhook_delivery_duration_seconds"
      threshold: "p99 < 30s"

    - name: payment_reconciliation
      description: "Payment records match processor"
      metric: "reconciliation_mismatches_total"
      threshold: "mismatches = 0"

  business:
    - name: merchant_activation_time
      description: "Time from signup to first payment"
      metric: "merchant_activation_hours"
      threshold: "p90 < 24h"

This structured format makes it easy to review SLIs with stakeholders and import into monitoring tools.

The metric instrumentation checklist

For each chosen SLI, ensure you can answer:

What's the exact definition? (e.g., "successful payment" means status=completed AND reconciled=true)
Where is it measured? (at the API gateway? In the application? At the database?)
What are the edge cases? (retries? partial failures? timeouts?)
How do we handle missing data? (gaps in metrics should fail open or closed?)
What's the unit and precision? (milliseconds? percentage to 2 decimal places?)
Can we simulate failures? (for testing alerts and dashboards)

Choosing meaningful SLIs for your SaaS

Now that we've discovered what matters through stakeholder interviews, let's implement both infrastructure and application SLIs for PaymentPro.

The four golden signals (with both perspectives)

Google's SRE book identifies four golden signals. Let's implement each from both infrastructure and application viewpoints:

1. Latency

Infrastructure perspective: How fast does the API respond?

# Infrastructure latency - API response time
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Application perspective: How fast do payments complete end-to-end?

# Application latency - Full payment processing time
histogram_quantile(0.95,
  sum(rate(payment_processing_duration_seconds_bucket[5m])) by (le)
)

2. Traffic

Infrastructure perspective: How many requests are we receiving?

# Infrastructure traffic - Requests per second
sum(rate(http_requests_total[5m])) by (service)

Application perspective: How many business transactions are happening?

# Application traffic - Payment attempts per minute
sum(rate(payment_attempts_total[1m])) by (payment_type, merchant_tier)

3. Errors

Infrastructure perspective: HTTP errors

# Infrastructure errors - 5xx responses
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))

Application perspective: Business logic failures

# Application errors - Payment failures (including valid declines)
sum(rate(payment_attempts_total{result!="success"}[5m])) by (failure_reason) /
sum(rate(payment_attempts_total[5m]))

4. Saturation

Infrastructure perspective: Resource utilization

# Infrastructure saturation - Database connection pool
(db_connections_active / db_connections_max) > 0.8

Application perspective: Business capacity limits

# Application saturation - Merchant transaction limits
(sum(rate(payment_volume_dollars[1h])) by (merchant_id) / 
 merchant_transaction_limit) > 0.9

Implementing comprehensive SLIs

Let's create a complete SLI implementation that captures both layers. For each SLI, we need to define:

For ratio-based SLIs (like availability):

Good events: Requests that succeeded
Total events: All requests attempted

For threshold-based SLIs (like latency):

Threshold metric: The specific boundary we're measuring against

Here's how to structure your SLIs:

Infrastructure SLIs:

API Availability
- Good events: All non-5xx responses
- Total events: All HTTP requests
API Latency
- Threshold: 95th percentile < 200ms
- Measured at: Load balancer or API gateway

Application SLIs:

Payment Success Rate
- Good events: Payments with status "completed"
- Total events: All payment attempts
Payment Consistency
- Good events: Reconciliation checks that match
- Total events: All reconciliation checks
- Note: Run every 30 minutes for accuracy
Webhook Delivery
- Good events: Webhooks delivered within 30s
- Total events: All webhook attempts

In practice, teams often define these in a structured format for their SLO tools:

# payment-service-slis.yaml
# This format works with tools like Sloth, Pyrra, or OpenSLO
service: payment-processor
slis:
  # Infrastructure SLIs
  - name: api_availability
    category: infrastructure
    description: "API endpoint reachability"
    implementation:
      good_events: |
        sum(rate(http_requests_total{status!~"5.."}[5m]))
      total_events: |
        sum(rate(http_requests_total[5m]))

  - name: api_latency
    category: infrastructure  
    description: "API response time for 95th percentile"
    implementation:
      threshold_metric: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
        ) < bool 0.2  # 200ms threshold

  # Application SLIs
  - name: payment_success_rate
    category: application
    description: "Successful payment completion rate"
    implementation:
      good_events: |
        sum(rate(payments_total{status="completed"}[5m]))
      total_events: |
        sum(rate(payments_total[5m]))

  - name: payment_consistency
    category: application
    description: "Payment data consistency with processor"
    implementation:
      good_events: |
        sum(rate(payment_reconciliation_matches_total[30m]))
      total_events: |
        sum(rate(payment_reconciliation_checks_total[30m]))

  - name: webhook_delivery_sli
    category: application
    description: "Webhooks delivered within 30 seconds"
    implementation:
      good_events: |
        sum(rate(webhook_deliveries_total{delivered_within_sla="true"}[5m]))
      total_events: |
        sum(rate(webhook_deliveries_total[5m]))

This structured approach ensures consistency across teams and makes it easy to generate dashboards and alerts automatically.

Advanced application-level SLIs

Some SLIs require custom business logic to calculate:

// Instrumenting application-specific SLIs in your code
package payment

var (
    // Business transaction SLI
    paymentCompleteness = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "payment_business_completeness_total",
            Help: "Tracks if payment completed all business requirements",
        },
        []string{"complete", "missing_step"},
    )

    // Data quality SLI
    paymentDataQuality = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "payment_data_quality_score",
            Help: "Score indicating payment data completeness",
        },
        []string{"merchant_id"},
    )
)

func ProcessPayment(payment *Payment) error {
    // Track infrastructure metric
    startTime := time.Now()
    defer func() {
        httpDuration.Observe(time.Since(startTime).Seconds())
    }()

    // Business logic
    if err := validatePayment(payment); err != nil {
        return err
    }

    // Process payment...
    result := processWithProvider(payment)

    // Track application metrics
    if result.Success {
        // Check business completeness
        if result.HasReceipt && result.HasWebhook && result.IsReconciled {
            paymentCompleteness.WithLabelValues("true", "none").Inc()
        } else {
            missing := identifyMissingSteps(result)
            paymentCompleteness.WithLabelValues("false", missing).Inc()
        }

        // Calculate data quality score
        qualityScore := calculateDataQuality(payment, result)
        paymentDataQuality.WithLabelValues(payment.MerchantID).Set(qualityScore)
    }

    return nil
}

// Helper to calculate business-specific quality metrics
func calculateDataQuality(payment *Payment, result *Result) float64 {
    score := 0.0
    maxScore := 5.0

    if payment.HasFullAddress() { score += 1.0 }
    if payment.HasValidEmail() { score += 1.0 }
    if payment.HasPhoneNumber() { score += 1.0 }
    if result.FraudScoreAvailable() { score += 1.0 }
    if result.Has3DSecure() { score += 1.0 }

    return score / maxScore
}

Composite SLIs for complex user journeys

Real user experiences often span multiple services. Here's how to create SLIs that reflect this:

# End-to-end payment flow SLI
# User can: authenticate → create payment → receive confirmation
(
  # Auth service available
  avg_over_time(sli:auth_availability:5m[30s]) *
  # Payment service can process
  avg_over_time(sli:payment_success_rate:5m[30s]) *
  # Notification service delivers
  avg_over_time(sli:notification_delivery:5m[30s])
) > 0.995  # All three must work 99.5% of the time

Choosing between infrastructure and application SLIs

Use this decision matrix:

Scenario	Infrastructure SLI	Application SLI	Both
API is up but returning empty data	✗	✓	Better
Payments process but webhooks fail	✗	✓	Better
Database is slow	✓	✓	Best
Third-party API degraded	✗	✓	Better
Memory leak causing OOM	✓	✗	OK
Business logic bug	✗	✓	Better

Pro tip: Start with infrastructure SLIs (easier to implement) but quickly add application SLIs (more meaningful).

Setting realistic SLOs (and not shooting yourself in the foot)

Here's where many teams mess up: they set SLOs based on current system performance rather than user needs. Just because your system can deliver 99.99% availability doesn't mean you should promise it.

The cost of nines

Let's do some error budget math. Error budget is the amount of unreliability you're allowed:

Error Budget = 100% - SLO Target

For different SLO targets over 30 days:

99% = 7.2 hours of downtime allowed
99.9% = 43.2 minutes of downtime allowed
99.95% = 21.6 minutes of downtime allowed
99.99% = 4.32 minutes of downtime allowed

Each additional nine roughly costs 10x more in engineering effort. That jump from 99.9% to 99.99%? That's the difference between "we can deploy during business hours" and "every deployment requires a war room."

Practical SLO setting

For PaymentPro, let's be smart about our SLOs:

# payment-service-slos.yaml
service: payment-service
slos:
  - name: payment-availability
    sli: 
      description: "Percentage of successful payment API calls"
    targets:
      - level: production
        objective: 99.9  # 43 minutes of error budget per month
      - level: staging  
        objective: 99.0  # More relaxed for testing

  - name: payment-latency
    sli:
      description: "95th percentile API response time"
    targets:
      - level: production
        objective: 200ms
      - level: staging
        objective: 500ms

Implementing error budgets that actually work

Error budgets transform the traditional dev vs. ops conflict into a shared objective. Here's the revolutionary idea: unreliability is a feature, not a bug. You "spend" your error budget on:

Feature deployments
Infrastructure migrations
Experiments
Accepted risk

When the budget is exhausted, you stop feature work and focus on reliability. It's that simple.

Let's implement error budget tracking:

# Error budget remaining (30-day window)
# Assuming 99.9% SLO
(
  (0.001) -  # Total budget (1 - 0.999)
  (1 - (sum(rate(payment_requests_total{status!~"5.."}[30d])) / 
        sum(rate(payment_requests_total[30d]))))
) / 0.001 * 100  # Percentage of budget remaining

Burn rate alerts - Your new best friend

Traditional threshold alerts are like smoke detectors that only go off when the house is already on fire. Burn rate alerts warn you when you're consuming error budget too quickly:

groups:
  - name: payment-slo-alerts
    interval: 30s
    rules:
      # Alert if we're burning budget 10x too fast
      # This would exhaust the monthly budget in 3 days
      - alert: PaymentHighErrorBudgetBurn
        expr: |
          (
            # 1-hour burn rate
            (1 - sum(rate(payment_requests_total{status!~"5.."}[1h])) / 
                 sum(rate(payment_requests_total[1h]))) > (10 * 0.001)
          ) and (
            # 5-minute burn rate (to avoid flapping)
            (1 - sum(rate(payment_requests_total{status!~"5.."}[5m])) / 
                 sum(rate(payment_requests_total[5m]))) > (10 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Payment service burning error budget 10x too fast"
          dashboard: "https://grafana.paymentpro.com/d/slo-payment"

The magic of multi-window burn rates: combining short (5m) and long (1h) windows prevents alert flapping while ensuring quick detection.

Recording rules for SLI efficiency

Calculating SLIs on every dashboard load is expensive. Recording rules pre-calculate and store results:

groups:
  - name: sli_recording_rules
    interval: 30s
    rules:
      # 5-minute payment success rate
      - record: sli:payment_success_rate:5m
        expr: |
          sum(rate(payment_requests_total{status!~"5.."}[5m])) by (service, environment) /
          sum(rate(payment_requests_total[5m])) by (service, environment)

      # 30-day payment success rate for error budget
      - record: sli:payment_success_rate:30d
        expr: |
          sum(rate(payment_requests_total{status!~"5.."}[30d])) by (service, environment) /
          sum(rate(payment_requests_total[30d])) by (service, environment)

      # 95th percentile latency
      - record: sli:payment_latency_p95:5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(payment_request_duration_seconds_bucket[5m])) by (service, environment, le)
          )

      # Error budget consumption rate
      - record: error_budget:payment:consumption_rate
        expr: |
          (1 - sli:payment_success_rate:5m) / 0.001  # 0.001 = 1 - 0.999 SLO

Aggregating service SLOs into product SLOs

PaymentPro isn't just one service - it's multiple microservices working together. How do we create product-level SLOs?

Critical path aggregation

For user journeys that require multiple services:

# User can complete payment only if ALL services work
min(
  sli:payment_success_rate:5m{service="auth"},
  sli:payment_success_rate:5m{service="fraud-check"},
  sli:payment_success_rate:5m{service="payment-processor"},
  sli:payment_success_rate:5m{service="notification"}
) by (environment)

Weighted aggregation

When services have different importance:

# Weighted by request volume
sum(
  sli:payment_success_rate:5m * 
  sum(rate(payment_requests_total[5m])) by (service)
) / sum(sum(rate(payment_requests_total[5m])) by (service))

Building a complete SLO hierarchy

# product-slos.yaml
product: PaymentPro
components:
  - service: auth
    weight: 0.3
    criticality: high
    slo: 99.95

  - service: payment-processor  
    weight: 0.5
    criticality: critical
    slo: 99.9

  - service: notification
    weight: 0.2  
    criticality: medium
    slo: 99.5

product_slo:
  availability: 99.9  # Calculated from components
  latency_p95: 300ms  # End-to-end latency

Practical implementation with sloth

Sloth generates Prometheus rules from simple SLO definitions. Here's a complete example:

# sloth-config.yaml
version: "prometheus/v1"
service: "payment-processor"
labels:
  team: "payments"
  product: "paymentpro"

slos:
  - name: "payment-requests-availability"
    objective: 99.9
    description: "PaymentPro API availability"
    sli:
      events:
        error_query: |
          sum(rate(payment_requests_total{job="payment-api",code=~"5.."}[5m]))
        total_query: |
          sum(rate(payment_requests_total{job="payment-api"}[5m]))
    alerting:
      name: PaymentAvailabilityAlert
      page_alert:
        labels:
          team: payments
          severity: critical
      ticket_alert:
        labels:
          team: payments  
          severity: warning
    windows:
      - duration: 5m
      - duration: 30m
      - duration: 1h
      - duration: 2h
      - duration: 6h
      - duration: 1d
      - duration: 3d
      - duration: 30d

Generate Prometheus rules:

sloth generate -i sloth-config.yaml -o prometheus-rules.yaml

Moving from reactive to proactive

Here's pseudocode for implementing SLO-based decision making:

def should_deploy_new_feature():
    error_budget = calculate_error_budget_remaining()

    if error_budget < 0.1:  # Less than 10% remaining
        log.warning("Error budget critical - no deployments")
        return False

    if error_budget < 0.3:  # Less than 30% remaining  
        if is_critical_security_fix():
            return True
        log.info("Error budget low - only critical fixes")
        return False

    # Healthy error budget
    risk_score = calculate_deployment_risk()
    budget_cost = estimate_error_budget_cost(risk_score)

    if budget_cost < error_budget * 0.1:  # Use max 10% of remaining
        return True

    return requires_architecture_review()

Implementing SLO-based on-call

Implement your on-call:

# Alert routing based on burn rate
route:
  group_by: ['alertname', 'team']
  routes:
    # Critical: Will exhaust budget in < 6 hours
    - match:
        severity: critical
        burn_rate: high
      receiver: pagerduty-immediate

    # Warning: Will exhaust budget in < 3 days  
    - match:
        severity: warning
        burn_rate: medium
      receiver: slack-channel

    # Info: Budget consumption above normal
    - match:
        severity: info
        burn_rate: low
      receiver: email-daily-digest

Negotiating SLAs that don't bankrupt you

When sales promises 99.99% availability without asking engineering:

def calculate_sla_penalty(availability, sla_target):
    """
    Real-world SLA penalty calculation
    """
    if availability >= sla_target:
        return 0

    breach_severity = sla_target - availability

    if breach_severity <= 0.1:  # Minor breach
        return monthly_revenue * 0.1
    elif breach_severity <= 0.5:  # Major breach
        return monthly_revenue * 0.25  
    else:  # Severe breach
        return monthly_revenue * 0.5

def realistic_sla_target(internal_slo, confidence=0.95):
    """
    Set SLA based on SLO with safety margin
    """
    safety_margin = (1 - confidence) * internal_slo
    return internal_slo - safety_margin

# Example: 99.9% SLO -> 99.5% SLA

Advanced patterns for mature teams

SLO-based capacity planning

# Predict when you'll hit capacity based on growth
predict_linear(
  sum(rate(payment_requests_total[1d])) by (service)[30d:1d],
  86400 * 30  # 30 days forward
) > bool 1000  # Capacity limit

Composite SLIs for complex user journeys

# Complete payment flow SLI
(
  # User can log in
  sli:auth_success_rate:5m *
  # Payment is processed
  sli:payment_success_rate:5m *  
  # User receives confirmation
  sli:notification_delivery_rate:5m *
  # All within acceptable time
  (sli:end_to_end_latency_p95:5m < bool 2.0)
)

Error budget policies

error_budget_policy:
  triggers:
    - budget_remaining: 75%
      actions:
        - normal_deployments
        - feature_experiments

    - budget_remaining: 50%
      actions:
        - deployments_require_approval
        - no_experiments
        - increase_testing

    - budget_remaining: 25%
      actions:
        - only_critical_fixes
        - mandatory_postmortems
        - reliability_sprint_planning

    - budget_remaining: 10%
      actions:
        - deployment_freeze
        - all_hands_reliability
        - executive_escalation

Putting it all together: Your SLO journey

Week 1-2: Instrument your code

// Add metrics to your service
var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "payment_request_duration_seconds",
            Help: "Payment request duration",
            Buckets: []float64{0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0},
        },
        []string{"method", "status"},
    )
)

func processPayment(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    status := "success"

    defer func() {
        duration := time.Since(start).Seconds()
        requestDuration.WithLabelValues(r.Method, status).Observe(duration)
    }()

    // Payment logic here
}

Week 3-4: Define your first SLOs

Start with availability (it's easiest)
Add latency once you understand percentiles
Don't overthink the targets - you'll adjust them

Week 5-6: Implement error budgets

Create dashboards showing budget consumption
Set up burn rate alerts
Practice saying "no" when budget is low

Week 7-8: Automate policies

Integrate with CI/CD
Create runbooks for budget exhaustion
Train the team on the new process

Month 3: Iterate and improve

Review SLO targets against user feedback
Adjust based on business needs
Consider product-level SLOs

The SLI validation framework

Before committing to an SLI, validate it meets these criteria:

The VALID test for SLIs

def validate_sli_candidate(metric):
    """
    V - Valuable to users
    A - Actionable by the team  
    L - Logically complete
    I - Implementable with current tools
    D - Defensible to stakeholders
    """

    checks = {
        "Valuable": does_metric_impact_user_experience(metric),
        "Actionable": can_team_improve_this_metric(metric),
        "Logically_complete": does_metric_cover_all_failure_modes(metric),
        "Implementable": can_we_measure_this_accurately(metric),
        "Defensible": can_we_explain_why_this_matters(metric)
    }

    score = sum(1 for check in checks.values() if check)

    if score == 5:
        return "Strong SLI candidate"
    elif score >= 3:
        return "Consider with modifications"
    else:
        return "Not suitable as SLI"

Example validation process

Let's validate two potential SLIs:

Candidate 1: CPU Usage < 80%

Valuable? Users don't care about CPU
Actionable? Can optimize code or scale
Logically complete? High CPU doesn't always mean problems
Implementable? Easy to measure
Defensible? Hard to justify to business Verdict: Not suitable (2/5)

Candidate 2: Payment success rate > 99.9%

Valuable? Direct user impact
Actionable? Can fix bugs, improve retry logic
Logically complete? Covers all payment failures
Implementable? Clear success/failure states
Defensible? Easy to explain revenue impact Verdict: Strong SLI candidate (5/5)

The SLI hierarchy of needs

Not all SLIs are created equal. Prioritize them:

Level 1: Core Functionality (Must have)
├── Service reachable (availability)
├── Basic operations work (success rate)
└── Acceptable performance (latency)

Level 2: Data Integrity (Should have)
├── Data consistency
├── Durability guarantees
└── Reconciliation accuracy

Level 3: User Experience (Nice to have)
├── Feature completeness
├── Cross-service workflows
└── Advanced functionality

Implement Level 1 SLIs first, then expand based on user feedback and incidents.

Your next steps

Start the conversation - Schedule interviews with product, development, and support teams
Audit your metrics - List everything you currently measure and categorize as infrastructure vs application
Pick one service - Don't boil the ocean
Implement both perspectives - Start with infrastructure SLIs, quickly add application SLIs
Measure current performance - You need a baseline for both layers
Set conservative SLOs - You can always tighten later
Implement burn rate alerts - Replace those threshold alerts
Track error budgets - Make them visible to everyone
Iterate based on reality - SLOs are living documents

Remember: Perfect is the enemy of good. Your first SLOs will be wrong, and that's okay. The goal isn't perfection; it's continuous improvement based on user outcomes rather than system metrics.

[Boost]

Soumyajyoti Mahalanobish — Wed, 14 May 2025 16:55:53 +0000

Soumyajyoti Mahalanobish

May 14 '25

Managing Shared Environments with Grace

Comments 1

4 min read

Managing Shared Environments with Grace

Soumyajyoti Mahalanobish — Wed, 14 May 2025 16:55:34 +0000

1. The Problem: A Single Test Environment, Too Many Developers

Every day, developers from multiple squads need access to a shared environment for staging deployments and end‑to‑end tests. Conflicts are routine. People hop into the environment mid‑session or, worse, forget to release it.

Imagine this: Developer A is debugging a flaky integration test. Mid‑way, Developer B deploys a different branch, unaware that the environment is in use. Chaos ensues. Slack threads grow long. Tempers rise. No one's to blame, but everyone's affected.

This scenario calls for:

A way to reserve the environment
Visibility into who's using it
Notifications for the next person in line

2. The Idea: A Self‑Service Queue with Visibility

This prototype envisions a lean and elegant system:

Developers click to join the queue
When their turn comes, they get notified on Slack
They can lock the environment with a single click
A reason is recorded to inform others of usage

No separate access control, no complicated UIs—just a small, robust system that integrates into developer tools and Slack.

3. System Architecture

The architecture comprises:

Twirp‑powered Go backend – lightweight RPCs over HTTP
Redis – queue state and locking with TTL
Slack – direct user notifications
Frontend – embedded in internal dashboards

Redis stores per‑application queues as lists (e.g., queue:env:shared) and lock state as a hash (e.g., lock:env:shared). Queue position, TTLs, and Slack IDs are all stored as structured JSON entries.

4. Protobuf Definitions

Interfaces are defined via buf.gen.sh:

message QueueEntry {
  string user_email = 1;
  string user_name  = 2;
  string reason     = 3;
  string slack_id   = 4;
  int64  timestamp  = 5;
}

message JoinQueueRequest {
  string     app_name = 1;
  QueueEntry entry    = 2;
}

message LeaveQueueRequest {
  string app_name   = 1;
  string user_email = 2;
}

message LockState {
  string user_email = 1;
  string reason     = 2;
  int64  timestamp  = 3;
  int64  expires_at = 4;
}

service Deployments {
  rpc JoinQueue(JoinQueueRequest) returns (JoinQueueResponse);
  rpc LeaveQueue(LeaveQueueRequest) returns (LeaveQueueResponse);
  rpc GetQueueStatus(GetQueueStatusRequest) returns (GetQueueStatusResponse);
}

5. Go Implementation Highlights

The backend implements Twirp RPCs and Redis atomic operations.

Joining the Queue

val, _ := json.Marshal(req.Entry)
pipe := s.redis.TxPipeline()
pipe.LRange(ctx, queueKey, 0, -1)
pipe.RPush(ctx, queueKey, val)
cmds, err := pipe.Exec(ctx)
if err != nil {
  return nil, err
}
entries := cmds[0].(*redis.StringSliceCmd).Val()
for _, item := range entries {
  if strings.Contains(item, req.Entry.UserEmail) {
    return nil, errors.New("already in queue")
  }
}

Promoting Lock on Leave

pipe := s.redis.TxPipeline()
pipe.LRem(ctx, queueKey, 0, leavingEntry)
pipe.LIndex(ctx, queueKey, 0)
if nextEntryJSON := pipe.Exec(ctx)[1].(*redis.StringCmd).Val(); nextEntryJSON != "" {
  var nextEntry deployment.QueueEntry
  json.Unmarshal([]byte(nextEntryJSON), &nextEntry)
  s.redis.HSet(ctx, lockKey, map[string]interface{}{
    "user_email": nextEntry.UserEmail,
    "reason":     nextEntry.Reason,
    "timestamp":  time.Now().Unix(),
  })
  notifySlack(nextEntry)
}

6. Slack Integration

Slack alerts are essential to ensure seamless transitions.

Notifications are enhanced with Block Kit:

blocks := []BlockElement{
  TextBlock(fmt.Sprintf(
    "You are now first in the test queue! Reason: *%s*", entry.Reason)),
  Button("Take Lock", lockURL),
}
SendSlackMessage(entry.SlackId, blocks)

Retries and exponential back‑off are handled via a wrapper:

for i := 0; i < 3; i++ {
  err := sendToSlackAPI(payload)
  if err == nil {
    break
  }
  time.Sleep(time.Duration(1<<i) * time.Second)
}

7. Frontend Integration

A single button toggles lock intent:

const handleToggle = async () => {
  const res = await fetch(`/api/toggle-lock`, { method: 'POST' });
  if (res.status === 200) refreshQueue();
};

UX states:

Not in queue → Join
First in queue → Take lock
Has lock → Release and notify next

Real‑time updates are supported via Redis pub/sub.

8. Testing: Critical for Internal Developer Platforms

Thorough testing is essential for internal developer platforms for several reasons:

Trust and Adoption: Developers will only use tools they can rely on. A flaky queue management system will be abandoned quickly.
Failure Amplification: A bug in a developer platform affects the productivity of entire engineering teams, not just individual users.
Complex State Transitions: Queue management involves multiple state transitions that must be tested rigorously to prevent deadlocks or lost queue positions.
Distributed Consistency: Ensuring consistent state between Redis, the application, and Slack notifications requires robust integration testing.
Error Recovery: Systems must be resilient to network issues, Redis failures, or Slack API outages.

Unit + integration tests:

func TestLockTransition(t *testing.T) {
  // Test Redis LRem, HSet, LIndex chain
  // Validate Slack calls were triggered
}

func TestQueueConcurrency(t *testing.T) {
  // Validate queue integrity with concurrent join/leave operations
}

func TestSlackNotificationRetries(t *testing.T) {
  // Ensure backoff strategy properly handles temporary Slack outages
}

Conclusion

This setup demonstrates how simple components like Redis, Slack, and Go can come together to orchestrate shared environment access with minimal friction. The queue model introduces fairness and predictability to what was once a chaotic workflow. If your developers are wrestling for staging access, it might be time to queue up—with grace.

The 3 AM Alert That Wasn't Actually a Problem

Soumyajyoti Mahalanobish — Thu, 08 May 2025 16:03:55 +0000

By Soumyajyoti, Senior Software Engineer @ ProtectAI

It's 3 AM. Your phone buzzes aggressively with an alert notification:

DatasourceNoData | 𝗙𝗜𝗥𝗜𝗡𝗚
→ Memory usage of [no value] in [no value] namespace is at %!f(string=)% (>80%)

Half-asleep, you grab your laptop, VPN into your infrastructure, and... everything's fine. No pods are crashing, no services are down, and CPU/memory usage across the cluster is normal. You've just been woken up by a false alert caused by a temporary blip in your monitoring system.

If this sounds familiar, you're not alone. Many Kubernetes operators and SREs face this exact challenge: how do you create robust monitoring alerts that catch real issues but don't wake you up for temporary glitches?

In this post, I'll share practical techniques our team developed for building reliable Grafana alerts for Kubernetes environments that strike the perfect balance between sensitivity and resilience.

The Hidden Costs of Alert Fatigue

Before diving into solutions, let's acknowledge why this problem matters.

False alerts aren't just annoying—they're expensive:

Engineer time and focus: Each unnecessary interruption costs ~23 minutes of recovery time to get back to productive work
Diminished alert credibility: Teams start ignoring alerts after too many false positives
SRE burnout: Nobody wants to be the on-call engineer for a system that cries wolf

At our organization, we discovered that over 60% of our after-hours alerts were false positives caused by monitoring system hiccups rather than actual infrastructure issues. That's why we invested time in refining our alerting strategy.

The Four Pillars of Reliable Kubernetes Alerts

Through trial and error, we've identified four key strategies that have dramatically reduced our false-positive rate while maintaining our ability to catch real issues.

1. Targeting the Right Workloads: Filtering for What Matters

When monitoring a Kubernetes cluster with dozens of namespaces and hundreds of deployments, alerting on everything is a recipe for noise. Instead, focus on what truly matters.

Here's how we built targeted CPU utilization alerts for critical deployments:

# CPU usage as percentage of limits for critical containers
sum by(namespace, container) (
  rate(container_cpu_usage_seconds_total{container=~"app-api|myworker"}[5m])
) / (
  sum by(namespace, container) (
    kube_pod_container_resource_limits{resource="cpu"}
  )
) * 100
or vector(0)  # Prevent "no data" scenarios

This query targets specific containers that your application depends on. The or vector(0) ensures the query always returns data, even when no matches are found.

2. Beyond Simple Thresholds: Intelligent Alert Conditions

Alerting when any pod hits 80% CPU for a split second will bury you in notifications. Instead, use these techniques for more intelligent conditions:

Add time requirements: Require conditions to persist for a meaningful period
Consider rate of change: Alert on rapid increases, not just absolute values
Use contextual thresholds: Different workloads have different "normal" ranges

In our Grafana configuration, we set a 2-minute "Pending period" before firing alerts:

This means a threshold must be exceeded continuously for 2 minutes before an alert fires, eliminating most transient spikes.

3. The `or vector(0)` Pattern: Ensuring Queries Always Return Data

One of the most frustrating Grafana alert issues occurs when your query returns no data, resulting in [no value] placeholders in alert notifications. This commonly happens when:

Metrics are temporarily unavailable
The data source connection experiences a brief hiccup
A label selector matches nothing (e.g., a pod is no longer running)

The solution? Add or vector(0) to your PromQL queries:

# Without vector(0) - can lead to "no data" alerts:
sum(kube_pod_status_phase{namespace="production", phase="Pending"})

# With vector(0) - always returns data:
sum(kube_pod_status_phase{namespace="production", phase="Pending"}) or vector(0)

This pattern ensures your query always returns values (with zeroes when no match), preventing those cryptic [no value] notifications.

4. The Secret Weapon: Proper No-Data and Error Handling

Even with perfect queries, temporary issues with Prometheus or Grafana can cause alert evaluation to fail. That's where Grafana's built-in error handling comes in.

In your alert configuration, find the "Configure no data and error handling" section and set:

"Alert state if no data or all values are null" to "Normal"
"Alert state if execution error or timeout" to "Normal"

This configuration tells Grafana to treat temporary data issues as normal conditions rather than alerting triggers. It's like telling your alert system, "If you're not sure, don't wake me up."

Conclusion: Sleep Better with Robust Alerting

Remember that alert tuning is an iterative process. Start with your most problematic alerts, implement these patterns, and observe the results before proceeding to others.

By implementing this, you'll create a monitoring system that respects your team's time and attention while still providing the critical safety net you need for production systems.

The next time you're on call, you might actually get some sleep!