DEV Community

Soumyajyoti Mahalanobish
Soumyajyoti Mahalanobish

Posted on

A Young Engineer's Guide to SLIs, SLOs, SLAs

If you're an early-career engineer drowning in observability dashboards, wrestling with Prometheus queries in Grafana Explore page, and wondering what exatly lies beyond your observability stack setup, this guide is for you. This is a primer to proactive reliability engineering using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).

Let's assume a fictional SaaS platform - PaymentPro, a payment processing service - and we will learn how to implement proper step by step reliability practices.

But before diving into PromQL queries and YAML configs, let's clarify what these acronyms actually mean:

SLIs (Service Level Indicators) are the metrics that matter to your users. Instead of CPU usage, think "percentage of payments processed successfully." They're quantitative measurements of service behavior as experienced by users.

SLOs (Service Level Objectives) are your internal promises. "We aim to process 99.9% of payments successfully" - that's an SLO. It's what you're shooting for, not what you're contractually obligated to deliver.

SLAs (Service Level Agreements) are the legal promises you make to customers, usually with financial consequences. "We guarantee 99.5% payment success rate or you get service credits" - that's an SLA. Always set your SLAs looser than your SLOs to give yourself breathing room.

Think of it this way:

  • SLI = What you measure
  • SLO = What you promise yourself
  • SLA = What you promise customers (with lawyers involved)

The two faces of SLIs: Infrastructure vs Application

Here's what many engineers miss: SLIs exist at multiple layers. You need both infrastructure SLIs and application functionality SLIs to get the complete picture.

Infrastructure SLIs answer the question: "Is my service reachable and responsive?"

  • API endpoint availability
  • Response time
  • Error rates
  • Throughput

Application Functionality SLIs answer the question: "Is my service doing what users expect?"

  • Business transactions completed successfully
  • Data consistency maintained
  • User workflows functioning correctly
  • Business rules properly enforced

For PaymentPro, here's the difference:

# Infrastructure SLI: "Can users reach the payment API?"
sum(rate(http_requests_total{status!~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))

# Application SLI: "Can users actually complete payments?"
sum(rate(payments_completed_total{status="success"}[5m])) /
sum(rate(payments_initiated_total[5m]))
Enter fullscreen mode Exit fullscreen mode

The API might be up (infrastructure ✓) but payments could be failing due to a third-party service issue (application ✗).

The SLI discovery process: Just ask questions

Before you write a single PromQL query, you need to understand what actually matters to your users and business. Here's a framework for discovering the right SLIs through conversations with different stakeholders.

Questions for product folks

  1. "What does success look like for a user of this service?"

    • Expected answer: "Users can process payments quickly and reliably"
    • Your follow-up: "Define 'quickly' - is that 200ms? 2 seconds? What's the threshold where users complain?"
  2. "What are the critical user journeys?"

    • Map out step-by-step what users do
    • Identify which steps are mandatory vs optional
    • Understand dependencies between steps
  3. "What complaints do we get from users?"

    • Look for patterns: "payments are slow" → latency SLI
    • "I can't see my payment history" → data availability SLI
    • "payments sometimes disappear" → consistency SLI
  4. "What would make a customer leave us for a competitor?"

    • This reveals your true SLAs
    • Often different from what engineering thinks

Questions to developers

  1. "What keeps you up at night about this service?"
   Developer: "The payment webhook system. If it fails, merchants don't get notifications."
   You: "How often does it fail? How do we know when it fails?"
Enter fullscreen mode Exit fullscreen mode
  1. "What are the critical code paths?"

    • Have them trace through the code
    • Identify external dependencies
    • Understand retry logic and failure modes
  2. "What metrics do you check during an incident?"

    • These are your SLI candidates
    • Ask: "Why this metric and not others?"
  3. "What's the difference between the service being 'up' vs 'working correctly'?"

    • This reveals the gap between infrastructure and application SLIs

Questions for customer support

  1. "What are the top 3 issues customers report?"

    • Quantify: how many tickets per week?
    • Severity: how angry are customers?
  2. "How do you know when something is wrong before customers complain?"

    • They often have informal monitoring
    • These instincts can become SLIs

The metric audit worksheet

Create a spreadsheet during discovery:

Metric Source Type User Impact Measurable? Candidate?
Payment success rate Devs Application Direct - payment fails Yes - payment_status
API latency Ops Infrastructure Indirect - UX degraded Yes - histogram
Database replication lag Ops Infrastructure Indirect - stale data Yes - seconds_behind Maybe
Background job queue depth Devs Application Delayed - notifications late Yes - queue_size
Memory usage Monitoring Infrastructure None until OOM Yes - percentage

The SLI selection framework

Not every metric should become an SLI. Here's a decision framework to evaluate each candidate:

Step 1: Does it have user impact?

  • Users directly experience this when it fails
  • It's purely internal with no user visibility
  • If no → Stop here, it's not an SLI

Step 2: Can we measure it reliably?

  • We have consistent, accurate data
  • Data is sporadic, estimated, or manually collected
  • If no → Stop here, find a measurable proxy

Step 3: Is it actionable?

  • When it degrades, we know what to fix
  • It's just an interesting number with no clear response
  • If no → Stop here, it won't drive improvements

Step 4: Is it a cause or just a symptom?

  • It's the root cause of user issues
  • It's a symptom of deeper problems
  • If symptom → Consider tracking the underlying cause instead

Step 5: Does it cover critical user journeys?

  • It represents core functionality users depend on
  • It's a nice-to-have feature
  • If yes → Strong SLI candidate!

How this looks in practice:

# Many teams codify this decision tree for consistency
def should_be_sli(metric):
    # Step 1: User impact
    if not metric.has_user_impact:
        return False, "No direct user impact"

    # Step 2: Measurability
    if not metric.is_measurable:
        return False, "Cannot measure reliably"

    # Step 3: Actionability
    if not metric.is_actionable:
        return False, "No clear action when it degrades"

    # Step 4: Is it a symptom or a cause?
    if metric.is_symptom:
        # Symptoms can be indicators, but prefer causes
        return Maybe, "Consider the underlying cause instead"

    # Step 5: Coverage
    if metric.covers_critical_user_journey:
        return True, "Covers critical functionality"

    return Maybe, "Evaluate against other candidates"
Enter fullscreen mode Exit fullscreen mode

Real-world example: Payment service SLI discovery

Let's walk through discovering SLIs for our PaymentPro service:

Session with Product Manager:

You: "What does success look like for payment processing?"
PM: "Merchants can accept payments 24/7, funds arrive quickly, and they get real-time notifications."

You: "Let's break that down. What does '24/7' mean exactly?"
PM: "The API should always accept payment attempts. Even if a payment fails due to insufficient funds, the API itself should respond."

You: "How quickly should funds arrive?"
PM: "For card payments, authorization within 2 seconds. Settlement is daily batch, not real-time."

You: "What about notifications?"
PM: "Webhooks should fire within 30 seconds of payment completion. Merchants build their whole flow around these."
Enter fullscreen mode Exit fullscreen mode

Extracted SLI candidates:

  1. API availability (infrastructure)
  2. Payment authorization latency (application)
  3. Webhook delivery time (application)
  4. Webhook delivery success rate (application)

Session with Lead Developer:

You: "Walk me through a payment request."
Dev: "Request hits API gateway → auth service → fraud check → payment processor → webhook queue → notification service."

You: "What can go wrong at each step?"
Dev: "Gateway: rare issues. Auth: tokens expire. Fraud check: sometimes times out. Payment processor: most failures here. Webhook: queue can back up."

You: "How do you know the payment actually processed?"
Dev: "We check the payment_status in the database, but also reconcile with the processor daily."
Enter fullscreen mode Exit fullscreen mode

Additional SLI candidates:

  1. Payment reconciliation accuracy (application)
  2. Fraud check availability (infrastructure)
  3. Queue processing time (application)

Categorizing your SLIs

Once you have candidates, organize them into three main categories:

Infrastructure SLIs - The foundation layer:

  • API Availability: Is the payment API responding to requests?

    • Metric: HTTP request success rate
    • Threshold: Less than 0.1% 5xx errors
  • API Latency: How fast does the API respond?

    • Metric: Response time distribution
    • Threshold: 95th percentile under 200ms

Application SLIs - The business logic layer:

  • Payment Success Rate: Are payments completing successfully?

    • Metric: Ratio of completed vs initiated payments
    • Threshold: 99.9% success rate
  • Webhook Delivery: Are merchants getting notifications?

    • Metric: Time from payment to webhook delivery
    • Threshold: 99th percentile under 30 seconds
  • Payment Reconciliation: Do our records match the payment processor?

    • Metric: Daily reconciliation mismatches
    • Threshold: Zero tolerance for mismatches

Business SLIs - The customer experience layer:

  • Merchant Activation Time: How quickly can new merchants start accepting payments?
    • Metric: Time from signup to first successful payment
    • Threshold: 90% activated within 24 hours

How teams typically document this:

# sli-inventory.yaml - Many teams maintain their SLI catalog in version control
sli_inventory:
  infrastructure:
    - name: api_availability
      description: "Payment API responding to requests"
      metric: "http_requests_total"
      threshold: "status < 500"

    - name: api_latency  
      description: "API response time"
      metric: "http_request_duration_seconds"
      threshold: "p95 < 200ms"

  application:
    - name: payment_success_rate
      description: "Payments completed successfully"
      metric: "payment_status"
      threshold: "status = 'completed'"

    - name: webhook_delivery
      description: "Webhooks delivered within SLA"
      metric: "webhook_delivery_duration_seconds"
      threshold: "p99 < 30s"

    - name: payment_reconciliation
      description: "Payment records match processor"
      metric: "reconciliation_mismatches_total"
      threshold: "mismatches = 0"

  business:
    - name: merchant_activation_time
      description: "Time from signup to first payment"
      metric: "merchant_activation_hours"
      threshold: "p90 < 24h"
Enter fullscreen mode Exit fullscreen mode

This structured format makes it easy to review SLIs with stakeholders and import into monitoring tools.

The metric instrumentation checklist

For each chosen SLI, ensure you can answer:

  • What's the exact definition? (e.g., "successful payment" means status=completed AND reconciled=true)
  • Where is it measured? (at the API gateway? In the application? At the database?)
  • What are the edge cases? (retries? partial failures? timeouts?)
  • How do we handle missing data? (gaps in metrics should fail open or closed?)
  • What's the unit and precision? (milliseconds? percentage to 2 decimal places?)
  • Can we simulate failures? (for testing alerts and dashboards)

Choosing meaningful SLIs for your SaaS

Now that we've discovered what matters through stakeholder interviews, let's implement both infrastructure and application SLIs for PaymentPro.

The four golden signals (with both perspectives)

Google's SRE book identifies four golden signals. Let's implement each from both infrastructure and application viewpoints:

1. Latency

Infrastructure perspective: How fast does the API respond?

# Infrastructure latency - API response time
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Enter fullscreen mode Exit fullscreen mode

Application perspective: How fast do payments complete end-to-end?

# Application latency - Full payment processing time
histogram_quantile(0.95,
  sum(rate(payment_processing_duration_seconds_bucket[5m])) by (le)
)
Enter fullscreen mode Exit fullscreen mode

2. Traffic

Infrastructure perspective: How many requests are we receiving?

# Infrastructure traffic - Requests per second
sum(rate(http_requests_total[5m])) by (service)
Enter fullscreen mode Exit fullscreen mode

Application perspective: How many business transactions are happening?

# Application traffic - Payment attempts per minute
sum(rate(payment_attempts_total[1m])) by (payment_type, merchant_tier)
Enter fullscreen mode Exit fullscreen mode

3. Errors

Infrastructure perspective: HTTP errors

# Infrastructure errors - 5xx responses
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m]))
Enter fullscreen mode Exit fullscreen mode

Application perspective: Business logic failures

# Application errors - Payment failures (including valid declines)
sum(rate(payment_attempts_total{result!="success"}[5m])) by (failure_reason) /
sum(rate(payment_attempts_total[5m]))
Enter fullscreen mode Exit fullscreen mode

4. Saturation

Infrastructure perspective: Resource utilization

# Infrastructure saturation - Database connection pool
(db_connections_active / db_connections_max) > 0.8
Enter fullscreen mode Exit fullscreen mode

Application perspective: Business capacity limits

# Application saturation - Merchant transaction limits
(sum(rate(payment_volume_dollars[1h])) by (merchant_id) / 
 merchant_transaction_limit) > 0.9
Enter fullscreen mode Exit fullscreen mode

Implementing comprehensive SLIs

Let's create a complete SLI implementation that captures both layers. For each SLI, we need to define:

For ratio-based SLIs (like availability):

  • Good events: Requests that succeeded
  • Total events: All requests attempted

For threshold-based SLIs (like latency):

  • Threshold metric: The specific boundary we're measuring against

Here's how to structure your SLIs:

Infrastructure SLIs:

  1. API Availability

    • Good events: All non-5xx responses
    • Total events: All HTTP requests
  2. API Latency

    • Threshold: 95th percentile < 200ms
    • Measured at: Load balancer or API gateway

Application SLIs:

  1. Payment Success Rate

    • Good events: Payments with status "completed"
    • Total events: All payment attempts
  2. Payment Consistency

    • Good events: Reconciliation checks that match
    • Total events: All reconciliation checks
    • Note: Run every 30 minutes for accuracy
  3. Webhook Delivery

    • Good events: Webhooks delivered within 30s
    • Total events: All webhook attempts

In practice, teams often define these in a structured format for their SLO tools:

# payment-service-slis.yaml
# This format works with tools like Sloth, Pyrra, or OpenSLO
service: payment-processor
slis:
  # Infrastructure SLIs
  - name: api_availability
    category: infrastructure
    description: "API endpoint reachability"
    implementation:
      good_events: |
        sum(rate(http_requests_total{status!~"5.."}[5m]))
      total_events: |
        sum(rate(http_requests_total[5m]))

  - name: api_latency
    category: infrastructure  
    description: "API response time for 95th percentile"
    implementation:
      threshold_metric: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
        ) < bool 0.2  # 200ms threshold

  # Application SLIs
  - name: payment_success_rate
    category: application
    description: "Successful payment completion rate"
    implementation:
      good_events: |
        sum(rate(payments_total{status="completed"}[5m]))
      total_events: |
        sum(rate(payments_total[5m]))

  - name: payment_consistency
    category: application
    description: "Payment data consistency with processor"
    implementation:
      good_events: |
        sum(rate(payment_reconciliation_matches_total[30m]))
      total_events: |
        sum(rate(payment_reconciliation_checks_total[30m]))

  - name: webhook_delivery_sli
    category: application
    description: "Webhooks delivered within 30 seconds"
    implementation:
      good_events: |
        sum(rate(webhook_deliveries_total{delivered_within_sla="true"}[5m]))
      total_events: |
        sum(rate(webhook_deliveries_total[5m]))
Enter fullscreen mode Exit fullscreen mode

This structured approach ensures consistency across teams and makes it easy to generate dashboards and alerts automatically.

Advanced application-level SLIs

Some SLIs require custom business logic to calculate:

// Instrumenting application-specific SLIs in your code
package payment

var (
    // Business transaction SLI
    paymentCompleteness = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "payment_business_completeness_total",
            Help: "Tracks if payment completed all business requirements",
        },
        []string{"complete", "missing_step"},
    )

    // Data quality SLI
    paymentDataQuality = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "payment_data_quality_score",
            Help: "Score indicating payment data completeness",
        },
        []string{"merchant_id"},
    )
)

func ProcessPayment(payment *Payment) error {
    // Track infrastructure metric
    startTime := time.Now()
    defer func() {
        httpDuration.Observe(time.Since(startTime).Seconds())
    }()

    // Business logic
    if err := validatePayment(payment); err != nil {
        return err
    }

    // Process payment...
    result := processWithProvider(payment)

    // Track application metrics
    if result.Success {
        // Check business completeness
        if result.HasReceipt && result.HasWebhook && result.IsReconciled {
            paymentCompleteness.WithLabelValues("true", "none").Inc()
        } else {
            missing := identifyMissingSteps(result)
            paymentCompleteness.WithLabelValues("false", missing).Inc()
        }

        // Calculate data quality score
        qualityScore := calculateDataQuality(payment, result)
        paymentDataQuality.WithLabelValues(payment.MerchantID).Set(qualityScore)
    }

    return nil
}

// Helper to calculate business-specific quality metrics
func calculateDataQuality(payment *Payment, result *Result) float64 {
    score := 0.0
    maxScore := 5.0

    if payment.HasFullAddress() { score += 1.0 }
    if payment.HasValidEmail() { score += 1.0 }
    if payment.HasPhoneNumber() { score += 1.0 }
    if result.FraudScoreAvailable() { score += 1.0 }
    if result.Has3DSecure() { score += 1.0 }

    return score / maxScore
}
Enter fullscreen mode Exit fullscreen mode

Composite SLIs for complex user journeys

Real user experiences often span multiple services. Here's how to create SLIs that reflect this:

# End-to-end payment flow SLI
# User can: authenticate → create payment → receive confirmation
(
  # Auth service available
  avg_over_time(sli:auth_availability:5m[30s]) *
  # Payment service can process
  avg_over_time(sli:payment_success_rate:5m[30s]) *
  # Notification service delivers
  avg_over_time(sli:notification_delivery:5m[30s])
) > 0.995  # All three must work 99.5% of the time
Enter fullscreen mode Exit fullscreen mode

Choosing between infrastructure and application SLIs

Use this decision matrix:

Scenario Infrastructure SLI Application SLI Both
API is up but returning empty data Better
Payments process but webhooks fail Better
Database is slow Best
Third-party API degraded Better
Memory leak causing OOM OK
Business logic bug Better

Pro tip: Start with infrastructure SLIs (easier to implement) but quickly add application SLIs (more meaningful).

Setting realistic SLOs (and not shooting yourself in the foot)

Here's where many teams mess up: they set SLOs based on current system performance rather than user needs. Just because your system can deliver 99.99% availability doesn't mean you should promise it.

The cost of nines

Let's do some error budget math. Error budget is the amount of unreliability you're allowed:

Error Budget = 100% - SLO Target
Enter fullscreen mode Exit fullscreen mode

For different SLO targets over 30 days:

  • 99% = 7.2 hours of downtime allowed
  • 99.9% = 43.2 minutes of downtime allowed
  • 99.95% = 21.6 minutes of downtime allowed
  • 99.99% = 4.32 minutes of downtime allowed

Each additional nine roughly costs 10x more in engineering effort. That jump from 99.9% to 99.99%? That's the difference between "we can deploy during business hours" and "every deployment requires a war room."

Practical SLO setting

For PaymentPro, let's be smart about our SLOs:

# payment-service-slos.yaml
service: payment-service
slos:
  - name: payment-availability
    sli: 
      description: "Percentage of successful payment API calls"
    targets:
      - level: production
        objective: 99.9  # 43 minutes of error budget per month
      - level: staging  
        objective: 99.0  # More relaxed for testing

  - name: payment-latency
    sli:
      description: "95th percentile API response time"
    targets:
      - level: production
        objective: 200ms
      - level: staging
        objective: 500ms
Enter fullscreen mode Exit fullscreen mode

Implementing error budgets that actually work

Error budgets transform the traditional dev vs. ops conflict into a shared objective. Here's the revolutionary idea: unreliability is a feature, not a bug. You "spend" your error budget on:

  • Feature deployments
  • Infrastructure migrations
  • Experiments
  • Accepted risk

When the budget is exhausted, you stop feature work and focus on reliability. It's that simple.

Let's implement error budget tracking:

# Error budget remaining (30-day window)
# Assuming 99.9% SLO
(
  (0.001) -  # Total budget (1 - 0.999)
  (1 - (sum(rate(payment_requests_total{status!~"5.."}[30d])) / 
        sum(rate(payment_requests_total[30d]))))
) / 0.001 * 100  # Percentage of budget remaining
Enter fullscreen mode Exit fullscreen mode

Burn rate alerts - Your new best friend

Traditional threshold alerts are like smoke detectors that only go off when the house is already on fire. Burn rate alerts warn you when you're consuming error budget too quickly:

groups:
  - name: payment-slo-alerts
    interval: 30s
    rules:
      # Alert if we're burning budget 10x too fast
      # This would exhaust the monthly budget in 3 days
      - alert: PaymentHighErrorBudgetBurn
        expr: |
          (
            # 1-hour burn rate
            (1 - sum(rate(payment_requests_total{status!~"5.."}[1h])) / 
                 sum(rate(payment_requests_total[1h]))) > (10 * 0.001)
          ) and (
            # 5-minute burn rate (to avoid flapping)
            (1 - sum(rate(payment_requests_total{status!~"5.."}[5m])) / 
                 sum(rate(payment_requests_total[5m]))) > (10 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Payment service burning error budget 10x too fast"
          dashboard: "https://grafana.paymentpro.com/d/slo-payment"
Enter fullscreen mode Exit fullscreen mode

The magic of multi-window burn rates: combining short (5m) and long (1h) windows prevents alert flapping while ensuring quick detection.

Recording rules for SLI efficiency

Calculating SLIs on every dashboard load is expensive. Recording rules pre-calculate and store results:

groups:
  - name: sli_recording_rules
    interval: 30s
    rules:
      # 5-minute payment success rate
      - record: sli:payment_success_rate:5m
        expr: |
          sum(rate(payment_requests_total{status!~"5.."}[5m])) by (service, environment) /
          sum(rate(payment_requests_total[5m])) by (service, environment)

      # 30-day payment success rate for error budget
      - record: sli:payment_success_rate:30d
        expr: |
          sum(rate(payment_requests_total{status!~"5.."}[30d])) by (service, environment) /
          sum(rate(payment_requests_total[30d])) by (service, environment)

      # 95th percentile latency
      - record: sli:payment_latency_p95:5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(payment_request_duration_seconds_bucket[5m])) by (service, environment, le)
          )

      # Error budget consumption rate
      - record: error_budget:payment:consumption_rate
        expr: |
          (1 - sli:payment_success_rate:5m) / 0.001  # 0.001 = 1 - 0.999 SLO
Enter fullscreen mode Exit fullscreen mode

Aggregating service SLOs into product SLOs

PaymentPro isn't just one service - it's multiple microservices working together. How do we create product-level SLOs?

Critical path aggregation

For user journeys that require multiple services:

# User can complete payment only if ALL services work
min(
  sli:payment_success_rate:5m{service="auth"},
  sli:payment_success_rate:5m{service="fraud-check"},
  sli:payment_success_rate:5m{service="payment-processor"},
  sli:payment_success_rate:5m{service="notification"}
) by (environment)
Enter fullscreen mode Exit fullscreen mode

Weighted aggregation

When services have different importance:

# Weighted by request volume
sum(
  sli:payment_success_rate:5m * 
  sum(rate(payment_requests_total[5m])) by (service)
) / sum(sum(rate(payment_requests_total[5m])) by (service))
Enter fullscreen mode Exit fullscreen mode

Building a complete SLO hierarchy

# product-slos.yaml
product: PaymentPro
components:
  - service: auth
    weight: 0.3
    criticality: high
    slo: 99.95

  - service: payment-processor  
    weight: 0.5
    criticality: critical
    slo: 99.9

  - service: notification
    weight: 0.2  
    criticality: medium
    slo: 99.5

product_slo:
  availability: 99.9  # Calculated from components
  latency_p95: 300ms  # End-to-end latency
Enter fullscreen mode Exit fullscreen mode

Practical implementation with sloth

Sloth generates Prometheus rules from simple SLO definitions. Here's a complete example:

# sloth-config.yaml
version: "prometheus/v1"
service: "payment-processor"
labels:
  team: "payments"
  product: "paymentpro"

slos:
  - name: "payment-requests-availability"
    objective: 99.9
    description: "PaymentPro API availability"
    sli:
      events:
        error_query: |
          sum(rate(payment_requests_total{job="payment-api",code=~"5.."}[5m]))
        total_query: |
          sum(rate(payment_requests_total{job="payment-api"}[5m]))
    alerting:
      name: PaymentAvailabilityAlert
      page_alert:
        labels:
          team: payments
          severity: critical
      ticket_alert:
        labels:
          team: payments  
          severity: warning
    windows:
      - duration: 5m
      - duration: 30m
      - duration: 1h
      - duration: 2h
      - duration: 6h
      - duration: 1d
      - duration: 3d
      - duration: 30d
Enter fullscreen mode Exit fullscreen mode

Generate Prometheus rules:

sloth generate -i sloth-config.yaml -o prometheus-rules.yaml
Enter fullscreen mode Exit fullscreen mode

Moving from reactive to proactive

Here's pseudocode for implementing SLO-based decision making:

def should_deploy_new_feature():
    error_budget = calculate_error_budget_remaining()

    if error_budget < 0.1:  # Less than 10% remaining
        log.warning("Error budget critical - no deployments")
        return False

    if error_budget < 0.3:  # Less than 30% remaining  
        if is_critical_security_fix():
            return True
        log.info("Error budget low - only critical fixes")
        return False

    # Healthy error budget
    risk_score = calculate_deployment_risk()
    budget_cost = estimate_error_budget_cost(risk_score)

    if budget_cost < error_budget * 0.1:  # Use max 10% of remaining
        return True

    return requires_architecture_review()
Enter fullscreen mode Exit fullscreen mode

Implementing SLO-based on-call

Implement your on-call:

# Alert routing based on burn rate
route:
  group_by: ['alertname', 'team']
  routes:
    # Critical: Will exhaust budget in < 6 hours
    - match:
        severity: critical
        burn_rate: high
      receiver: pagerduty-immediate

    # Warning: Will exhaust budget in < 3 days  
    - match:
        severity: warning
        burn_rate: medium
      receiver: slack-channel

    # Info: Budget consumption above normal
    - match:
        severity: info
        burn_rate: low
      receiver: email-daily-digest
Enter fullscreen mode Exit fullscreen mode

Negotiating SLAs that don't bankrupt you

When sales promises 99.99% availability without asking engineering:

def calculate_sla_penalty(availability, sla_target):
    """
    Real-world SLA penalty calculation
    """
    if availability >= sla_target:
        return 0

    breach_severity = sla_target - availability

    if breach_severity <= 0.1:  # Minor breach
        return monthly_revenue * 0.1
    elif breach_severity <= 0.5:  # Major breach
        return monthly_revenue * 0.25  
    else:  # Severe breach
        return monthly_revenue * 0.5

def realistic_sla_target(internal_slo, confidence=0.95):
    """
    Set SLA based on SLO with safety margin
    """
    safety_margin = (1 - confidence) * internal_slo
    return internal_slo - safety_margin

# Example: 99.9% SLO -> 99.5% SLA
Enter fullscreen mode Exit fullscreen mode

Advanced patterns for mature teams

SLO-based capacity planning

# Predict when you'll hit capacity based on growth
predict_linear(
  sum(rate(payment_requests_total[1d])) by (service)[30d:1d],
  86400 * 30  # 30 days forward
) > bool 1000  # Capacity limit
Enter fullscreen mode Exit fullscreen mode

Composite SLIs for complex user journeys

# Complete payment flow SLI
(
  # User can log in
  sli:auth_success_rate:5m *
  # Payment is processed
  sli:payment_success_rate:5m *  
  # User receives confirmation
  sli:notification_delivery_rate:5m *
  # All within acceptable time
  (sli:end_to_end_latency_p95:5m < bool 2.0)
)
Enter fullscreen mode Exit fullscreen mode

Error budget policies

error_budget_policy:
  triggers:
    - budget_remaining: 75%
      actions:
        - normal_deployments
        - feature_experiments

    - budget_remaining: 50%
      actions:
        - deployments_require_approval
        - no_experiments
        - increase_testing

    - budget_remaining: 25%
      actions:
        - only_critical_fixes
        - mandatory_postmortems
        - reliability_sprint_planning

    - budget_remaining: 10%
      actions:
        - deployment_freeze
        - all_hands_reliability
        - executive_escalation
Enter fullscreen mode Exit fullscreen mode

Putting it all together: Your SLO journey

Week 1-2: Instrument your code

// Add metrics to your service
var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "payment_request_duration_seconds",
            Help: "Payment request duration",
            Buckets: []float64{0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0},
        },
        []string{"method", "status"},
    )
)

func processPayment(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    status := "success"

    defer func() {
        duration := time.Since(start).Seconds()
        requestDuration.WithLabelValues(r.Method, status).Observe(duration)
    }()

    // Payment logic here
}
Enter fullscreen mode Exit fullscreen mode

Week 3-4: Define your first SLOs

  • Start with availability (it's easiest)
  • Add latency once you understand percentiles
  • Don't overthink the targets - you'll adjust them

Week 5-6: Implement error budgets

  • Create dashboards showing budget consumption
  • Set up burn rate alerts
  • Practice saying "no" when budget is low

Week 7-8: Automate policies

  • Integrate with CI/CD
  • Create runbooks for budget exhaustion
  • Train the team on the new process

Month 3: Iterate and improve

  • Review SLO targets against user feedback
  • Adjust based on business needs
  • Consider product-level SLOs

The SLI validation framework

Before committing to an SLI, validate it meets these criteria:

The VALID test for SLIs

def validate_sli_candidate(metric):
    """
    V - Valuable to users
    A - Actionable by the team  
    L - Logically complete
    I - Implementable with current tools
    D - Defensible to stakeholders
    """

    checks = {
        "Valuable": does_metric_impact_user_experience(metric),
        "Actionable": can_team_improve_this_metric(metric),
        "Logically_complete": does_metric_cover_all_failure_modes(metric),
        "Implementable": can_we_measure_this_accurately(metric),
        "Defensible": can_we_explain_why_this_matters(metric)
    }

    score = sum(1 for check in checks.values() if check)

    if score == 5:
        return "Strong SLI candidate"
    elif score >= 3:
        return "Consider with modifications"
    else:
        return "Not suitable as SLI"
Enter fullscreen mode Exit fullscreen mode

Example validation process

Let's validate two potential SLIs:

Candidate 1: CPU Usage < 80%

  • Valuable? Users don't care about CPU
  • Actionable? Can optimize code or scale
  • Logically complete? High CPU doesn't always mean problems
  • Implementable? Easy to measure
  • Defensible? Hard to justify to business Verdict: Not suitable (2/5)

Candidate 2: Payment success rate > 99.9%

  • Valuable? Direct user impact
  • Actionable? Can fix bugs, improve retry logic
  • Logically complete? Covers all payment failures
  • Implementable? Clear success/failure states
  • Defensible? Easy to explain revenue impact Verdict: Strong SLI candidate (5/5)

The SLI hierarchy of needs

Not all SLIs are created equal. Prioritize them:

Level 1: Core Functionality (Must have)
├── Service reachable (availability)
├── Basic operations work (success rate)
└── Acceptable performance (latency)

Level 2: Data Integrity (Should have)
├── Data consistency
├── Durability guarantees
└── Reconciliation accuracy

Level 3: User Experience (Nice to have)
├── Feature completeness
├── Cross-service workflows
└── Advanced functionality
Enter fullscreen mode Exit fullscreen mode

Implement Level 1 SLIs first, then expand based on user feedback and incidents.

Your next steps

  1. Start the conversation - Schedule interviews with product, development, and support teams
  2. Audit your metrics - List everything you currently measure and categorize as infrastructure vs application
  3. Pick one service - Don't boil the ocean
  4. Implement both perspectives - Start with infrastructure SLIs, quickly add application SLIs
  5. Measure current performance - You need a baseline for both layers
  6. Set conservative SLOs - You can always tighten later
  7. Implement burn rate alerts - Replace those threshold alerts
  8. Track error budgets - Make them visible to everyone
  9. Iterate based on reality - SLOs are living documents

Remember: Perfect is the enemy of good. Your first SLOs will be wrong, and that's okay. The goal isn't perfection; it's continuous improvement based on user outcomes rather than system metrics.

Top comments (0)