If you're an early-career engineer drowning in observability dashboards, wrestling with Prometheus queries in Grafana Explore page, and wondering what exatly lies beyond your observability stack setup, this guide is for you. This is a primer to proactive reliability engineering using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).
Let's assume a fictional SaaS platform - PaymentPro, a payment processing service - and we will learn how to implement proper step by step reliability practices.
But before diving into PromQL queries and YAML configs, let's clarify what these acronyms actually mean:
SLIs (Service Level Indicators) are the metrics that matter to your users. Instead of CPU usage, think "percentage of payments processed successfully." They're quantitative measurements of service behavior as experienced by users.
SLOs (Service Level Objectives) are your internal promises. "We aim to process 99.9% of payments successfully" - that's an SLO. It's what you're shooting for, not what you're contractually obligated to deliver.
SLAs (Service Level Agreements) are the legal promises you make to customers, usually with financial consequences. "We guarantee 99.5% payment success rate or you get service credits" - that's an SLA. Always set your SLAs looser than your SLOs to give yourself breathing room.
Think of it this way:
- SLI = What you measure
- SLO = What you promise yourself
- SLA = What you promise customers (with lawyers involved)
The two faces of SLIs: Infrastructure vs Application
Here's what many engineers miss: SLIs exist at multiple layers. You need both infrastructure SLIs and application functionality SLIs to get the complete picture.
Infrastructure SLIs answer the question: "Is my service reachable and responsive?"
- API endpoint availability
- Response time
- Error rates
- Throughput
Application Functionality SLIs answer the question: "Is my service doing what users expect?"
- Business transactions completed successfully
- Data consistency maintained
- User workflows functioning correctly
- Business rules properly enforced
For PaymentPro, here's the difference:
# Infrastructure SLI: "Can users reach the payment API?"
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
# Application SLI: "Can users actually complete payments?"
sum(rate(payments_completed_total{status="success"}[5m])) /
sum(rate(payments_initiated_total[5m]))
The API might be up (infrastructure ✓) but payments could be failing due to a third-party service issue (application ✗).
The SLI discovery process: Just ask questions
Before you write a single PromQL query, you need to understand what actually matters to your users and business. Here's a framework for discovering the right SLIs through conversations with different stakeholders.
Questions for product folks
-
"What does success look like for a user of this service?"
- Expected answer: "Users can process payments quickly and reliably"
- Your follow-up: "Define 'quickly' - is that 200ms? 2 seconds? What's the threshold where users complain?"
-
"What are the critical user journeys?"
- Map out step-by-step what users do
- Identify which steps are mandatory vs optional
- Understand dependencies between steps
-
"What complaints do we get from users?"
- Look for patterns: "payments are slow" → latency SLI
- "I can't see my payment history" → data availability SLI
- "payments sometimes disappear" → consistency SLI
-
"What would make a customer leave us for a competitor?"
- This reveals your true SLAs
- Often different from what engineering thinks
Questions to developers
- "What keeps you up at night about this service?"
Developer: "The payment webhook system. If it fails, merchants don't get notifications."
You: "How often does it fail? How do we know when it fails?"
-
"What are the critical code paths?"
- Have them trace through the code
- Identify external dependencies
- Understand retry logic and failure modes
-
"What metrics do you check during an incident?"
- These are your SLI candidates
- Ask: "Why this metric and not others?"
-
"What's the difference between the service being 'up' vs 'working correctly'?"
- This reveals the gap between infrastructure and application SLIs
Questions for customer support
-
"What are the top 3 issues customers report?"
- Quantify: how many tickets per week?
- Severity: how angry are customers?
-
"How do you know when something is wrong before customers complain?"
- They often have informal monitoring
- These instincts can become SLIs
The metric audit worksheet
Create a spreadsheet during discovery:
Metric | Source | Type | User Impact | Measurable? | Candidate? |
---|---|---|---|---|---|
Payment success rate | Devs | Application | Direct - payment fails | Yes - payment_status | ✓ |
API latency | Ops | Infrastructure | Indirect - UX degraded | Yes - histogram | ✓ |
Database replication lag | Ops | Infrastructure | Indirect - stale data | Yes - seconds_behind | Maybe |
Background job queue depth | Devs | Application | Delayed - notifications late | Yes - queue_size | ✓ |
Memory usage | Monitoring | Infrastructure | None until OOM | Yes - percentage | ✗ |
The SLI selection framework
Not every metric should become an SLI. Here's a decision framework to evaluate each candidate:
Step 1: Does it have user impact?
- Users directly experience this when it fails
- It's purely internal with no user visibility
- If no → Stop here, it's not an SLI
Step 2: Can we measure it reliably?
- We have consistent, accurate data
- Data is sporadic, estimated, or manually collected
- If no → Stop here, find a measurable proxy
Step 3: Is it actionable?
- When it degrades, we know what to fix
- It's just an interesting number with no clear response
- If no → Stop here, it won't drive improvements
Step 4: Is it a cause or just a symptom?
- It's the root cause of user issues
- It's a symptom of deeper problems
- If symptom → Consider tracking the underlying cause instead
Step 5: Does it cover critical user journeys?
- It represents core functionality users depend on
- It's a nice-to-have feature
- If yes → Strong SLI candidate!
How this looks in practice:
# Many teams codify this decision tree for consistency
def should_be_sli(metric):
# Step 1: User impact
if not metric.has_user_impact:
return False, "No direct user impact"
# Step 2: Measurability
if not metric.is_measurable:
return False, "Cannot measure reliably"
# Step 3: Actionability
if not metric.is_actionable:
return False, "No clear action when it degrades"
# Step 4: Is it a symptom or a cause?
if metric.is_symptom:
# Symptoms can be indicators, but prefer causes
return Maybe, "Consider the underlying cause instead"
# Step 5: Coverage
if metric.covers_critical_user_journey:
return True, "Covers critical functionality"
return Maybe, "Evaluate against other candidates"
Real-world example: Payment service SLI discovery
Let's walk through discovering SLIs for our PaymentPro service:
Session with Product Manager:
You: "What does success look like for payment processing?"
PM: "Merchants can accept payments 24/7, funds arrive quickly, and they get real-time notifications."
You: "Let's break that down. What does '24/7' mean exactly?"
PM: "The API should always accept payment attempts. Even if a payment fails due to insufficient funds, the API itself should respond."
You: "How quickly should funds arrive?"
PM: "For card payments, authorization within 2 seconds. Settlement is daily batch, not real-time."
You: "What about notifications?"
PM: "Webhooks should fire within 30 seconds of payment completion. Merchants build their whole flow around these."
Extracted SLI candidates:
- API availability (infrastructure)
- Payment authorization latency (application)
- Webhook delivery time (application)
- Webhook delivery success rate (application)
Session with Lead Developer:
You: "Walk me through a payment request."
Dev: "Request hits API gateway → auth service → fraud check → payment processor → webhook queue → notification service."
You: "What can go wrong at each step?"
Dev: "Gateway: rare issues. Auth: tokens expire. Fraud check: sometimes times out. Payment processor: most failures here. Webhook: queue can back up."
You: "How do you know the payment actually processed?"
Dev: "We check the payment_status in the database, but also reconcile with the processor daily."
Additional SLI candidates:
- Payment reconciliation accuracy (application)
- Fraud check availability (infrastructure)
- Queue processing time (application)
Categorizing your SLIs
Once you have candidates, organize them into three main categories:
Infrastructure SLIs - The foundation layer:
-
API Availability: Is the payment API responding to requests?
- Metric: HTTP request success rate
- Threshold: Less than 0.1% 5xx errors
-
API Latency: How fast does the API respond?
- Metric: Response time distribution
- Threshold: 95th percentile under 200ms
Application SLIs - The business logic layer:
-
Payment Success Rate: Are payments completing successfully?
- Metric: Ratio of completed vs initiated payments
- Threshold: 99.9% success rate
-
Webhook Delivery: Are merchants getting notifications?
- Metric: Time from payment to webhook delivery
- Threshold: 99th percentile under 30 seconds
-
Payment Reconciliation: Do our records match the payment processor?
- Metric: Daily reconciliation mismatches
- Threshold: Zero tolerance for mismatches
Business SLIs - The customer experience layer:
-
Merchant Activation Time: How quickly can new merchants start accepting payments?
- Metric: Time from signup to first successful payment
- Threshold: 90% activated within 24 hours
How teams typically document this:
# sli-inventory.yaml - Many teams maintain their SLI catalog in version control
sli_inventory:
infrastructure:
- name: api_availability
description: "Payment API responding to requests"
metric: "http_requests_total"
threshold: "status < 500"
- name: api_latency
description: "API response time"
metric: "http_request_duration_seconds"
threshold: "p95 < 200ms"
application:
- name: payment_success_rate
description: "Payments completed successfully"
metric: "payment_status"
threshold: "status = 'completed'"
- name: webhook_delivery
description: "Webhooks delivered within SLA"
metric: "webhook_delivery_duration_seconds"
threshold: "p99 < 30s"
- name: payment_reconciliation
description: "Payment records match processor"
metric: "reconciliation_mismatches_total"
threshold: "mismatches = 0"
business:
- name: merchant_activation_time
description: "Time from signup to first payment"
metric: "merchant_activation_hours"
threshold: "p90 < 24h"
This structured format makes it easy to review SLIs with stakeholders and import into monitoring tools.
The metric instrumentation checklist
For each chosen SLI, ensure you can answer:
- What's the exact definition? (e.g., "successful payment" means status=completed AND reconciled=true)
- Where is it measured? (at the API gateway? In the application? At the database?)
- What are the edge cases? (retries? partial failures? timeouts?)
- How do we handle missing data? (gaps in metrics should fail open or closed?)
- What's the unit and precision? (milliseconds? percentage to 2 decimal places?)
- Can we simulate failures? (for testing alerts and dashboards)
Choosing meaningful SLIs for your SaaS
Now that we've discovered what matters through stakeholder interviews, let's implement both infrastructure and application SLIs for PaymentPro.
The four golden signals (with both perspectives)
Google's SRE book identifies four golden signals. Let's implement each from both infrastructure and application viewpoints:
1. Latency
Infrastructure perspective: How fast does the API respond?
# Infrastructure latency - API response time
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Application perspective: How fast do payments complete end-to-end?
# Application latency - Full payment processing time
histogram_quantile(0.95,
sum(rate(payment_processing_duration_seconds_bucket[5m])) by (le)
)
2. Traffic
Infrastructure perspective: How many requests are we receiving?
# Infrastructure traffic - Requests per second
sum(rate(http_requests_total[5m])) by (service)
Application perspective: How many business transactions are happening?
# Application traffic - Payment attempts per minute
sum(rate(payment_attempts_total[1m])) by (payment_type, merchant_tier)
3. Errors
Infrastructure perspective: HTTP errors
# Infrastructure errors - 5xx responses
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
Application perspective: Business logic failures
# Application errors - Payment failures (including valid declines)
sum(rate(payment_attempts_total{result!="success"}[5m])) by (failure_reason) /
sum(rate(payment_attempts_total[5m]))
4. Saturation
Infrastructure perspective: Resource utilization
# Infrastructure saturation - Database connection pool
(db_connections_active / db_connections_max) > 0.8
Application perspective: Business capacity limits
# Application saturation - Merchant transaction limits
(sum(rate(payment_volume_dollars[1h])) by (merchant_id) /
merchant_transaction_limit) > 0.9
Implementing comprehensive SLIs
Let's create a complete SLI implementation that captures both layers. For each SLI, we need to define:
For ratio-based SLIs (like availability):
- Good events: Requests that succeeded
- Total events: All requests attempted
For threshold-based SLIs (like latency):
- Threshold metric: The specific boundary we're measuring against
Here's how to structure your SLIs:
Infrastructure SLIs:
-
API Availability
- Good events: All non-5xx responses
- Total events: All HTTP requests
-
API Latency
- Threshold: 95th percentile < 200ms
- Measured at: Load balancer or API gateway
Application SLIs:
-
Payment Success Rate
- Good events: Payments with status "completed"
- Total events: All payment attempts
-
Payment Consistency
- Good events: Reconciliation checks that match
- Total events: All reconciliation checks
- Note: Run every 30 minutes for accuracy
-
Webhook Delivery
- Good events: Webhooks delivered within 30s
- Total events: All webhook attempts
In practice, teams often define these in a structured format for their SLO tools:
# payment-service-slis.yaml
# This format works with tools like Sloth, Pyrra, or OpenSLO
service: payment-processor
slis:
# Infrastructure SLIs
- name: api_availability
category: infrastructure
description: "API endpoint reachability"
implementation:
good_events: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
total_events: |
sum(rate(http_requests_total[5m]))
- name: api_latency
category: infrastructure
description: "API response time for 95th percentile"
implementation:
threshold_metric: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) < bool 0.2 # 200ms threshold
# Application SLIs
- name: payment_success_rate
category: application
description: "Successful payment completion rate"
implementation:
good_events: |
sum(rate(payments_total{status="completed"}[5m]))
total_events: |
sum(rate(payments_total[5m]))
- name: payment_consistency
category: application
description: "Payment data consistency with processor"
implementation:
good_events: |
sum(rate(payment_reconciliation_matches_total[30m]))
total_events: |
sum(rate(payment_reconciliation_checks_total[30m]))
- name: webhook_delivery_sli
category: application
description: "Webhooks delivered within 30 seconds"
implementation:
good_events: |
sum(rate(webhook_deliveries_total{delivered_within_sla="true"}[5m]))
total_events: |
sum(rate(webhook_deliveries_total[5m]))
This structured approach ensures consistency across teams and makes it easy to generate dashboards and alerts automatically.
Advanced application-level SLIs
Some SLIs require custom business logic to calculate:
// Instrumenting application-specific SLIs in your code
package payment
var (
// Business transaction SLI
paymentCompleteness = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "payment_business_completeness_total",
Help: "Tracks if payment completed all business requirements",
},
[]string{"complete", "missing_step"},
)
// Data quality SLI
paymentDataQuality = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "payment_data_quality_score",
Help: "Score indicating payment data completeness",
},
[]string{"merchant_id"},
)
)
func ProcessPayment(payment *Payment) error {
// Track infrastructure metric
startTime := time.Now()
defer func() {
httpDuration.Observe(time.Since(startTime).Seconds())
}()
// Business logic
if err := validatePayment(payment); err != nil {
return err
}
// Process payment...
result := processWithProvider(payment)
// Track application metrics
if result.Success {
// Check business completeness
if result.HasReceipt && result.HasWebhook && result.IsReconciled {
paymentCompleteness.WithLabelValues("true", "none").Inc()
} else {
missing := identifyMissingSteps(result)
paymentCompleteness.WithLabelValues("false", missing).Inc()
}
// Calculate data quality score
qualityScore := calculateDataQuality(payment, result)
paymentDataQuality.WithLabelValues(payment.MerchantID).Set(qualityScore)
}
return nil
}
// Helper to calculate business-specific quality metrics
func calculateDataQuality(payment *Payment, result *Result) float64 {
score := 0.0
maxScore := 5.0
if payment.HasFullAddress() { score += 1.0 }
if payment.HasValidEmail() { score += 1.0 }
if payment.HasPhoneNumber() { score += 1.0 }
if result.FraudScoreAvailable() { score += 1.0 }
if result.Has3DSecure() { score += 1.0 }
return score / maxScore
}
Composite SLIs for complex user journeys
Real user experiences often span multiple services. Here's how to create SLIs that reflect this:
# End-to-end payment flow SLI
# User can: authenticate → create payment → receive confirmation
(
# Auth service available
avg_over_time(sli:auth_availability:5m[30s]) *
# Payment service can process
avg_over_time(sli:payment_success_rate:5m[30s]) *
# Notification service delivers
avg_over_time(sli:notification_delivery:5m[30s])
) > 0.995 # All three must work 99.5% of the time
Choosing between infrastructure and application SLIs
Use this decision matrix:
Scenario | Infrastructure SLI | Application SLI | Both |
---|---|---|---|
API is up but returning empty data | ✗ | ✓ | Better |
Payments process but webhooks fail | ✗ | ✓ | Better |
Database is slow | ✓ | ✓ | Best |
Third-party API degraded | ✗ | ✓ | Better |
Memory leak causing OOM | ✓ | ✗ | OK |
Business logic bug | ✗ | ✓ | Better |
Pro tip: Start with infrastructure SLIs (easier to implement) but quickly add application SLIs (more meaningful).
Setting realistic SLOs (and not shooting yourself in the foot)
Here's where many teams mess up: they set SLOs based on current system performance rather than user needs. Just because your system can deliver 99.99% availability doesn't mean you should promise it.
The cost of nines
Let's do some error budget math. Error budget is the amount of unreliability you're allowed:
Error Budget = 100% - SLO Target
For different SLO targets over 30 days:
- 99% = 7.2 hours of downtime allowed
- 99.9% = 43.2 minutes of downtime allowed
- 99.95% = 21.6 minutes of downtime allowed
- 99.99% = 4.32 minutes of downtime allowed
Each additional nine roughly costs 10x more in engineering effort. That jump from 99.9% to 99.99%? That's the difference between "we can deploy during business hours" and "every deployment requires a war room."
Practical SLO setting
For PaymentPro, let's be smart about our SLOs:
# payment-service-slos.yaml
service: payment-service
slos:
- name: payment-availability
sli:
description: "Percentage of successful payment API calls"
targets:
- level: production
objective: 99.9 # 43 minutes of error budget per month
- level: staging
objective: 99.0 # More relaxed for testing
- name: payment-latency
sli:
description: "95th percentile API response time"
targets:
- level: production
objective: 200ms
- level: staging
objective: 500ms
Implementing error budgets that actually work
Error budgets transform the traditional dev vs. ops conflict into a shared objective. Here's the revolutionary idea: unreliability is a feature, not a bug. You "spend" your error budget on:
- Feature deployments
- Infrastructure migrations
- Experiments
- Accepted risk
When the budget is exhausted, you stop feature work and focus on reliability. It's that simple.
Let's implement error budget tracking:
# Error budget remaining (30-day window)
# Assuming 99.9% SLO
(
(0.001) - # Total budget (1 - 0.999)
(1 - (sum(rate(payment_requests_total{status!~"5.."}[30d])) /
sum(rate(payment_requests_total[30d]))))
) / 0.001 * 100 # Percentage of budget remaining
Burn rate alerts - Your new best friend
Traditional threshold alerts are like smoke detectors that only go off when the house is already on fire. Burn rate alerts warn you when you're consuming error budget too quickly:
groups:
- name: payment-slo-alerts
interval: 30s
rules:
# Alert if we're burning budget 10x too fast
# This would exhaust the monthly budget in 3 days
- alert: PaymentHighErrorBudgetBurn
expr: |
(
# 1-hour burn rate
(1 - sum(rate(payment_requests_total{status!~"5.."}[1h])) /
sum(rate(payment_requests_total[1h]))) > (10 * 0.001)
) and (
# 5-minute burn rate (to avoid flapping)
(1 - sum(rate(payment_requests_total{status!~"5.."}[5m])) /
sum(rate(payment_requests_total[5m]))) > (10 * 0.001)
)
for: 2m
labels:
severity: critical
team: payments
annotations:
summary: "Payment service burning error budget 10x too fast"
dashboard: "https://grafana.paymentpro.com/d/slo-payment"
The magic of multi-window burn rates: combining short (5m) and long (1h) windows prevents alert flapping while ensuring quick detection.
Recording rules for SLI efficiency
Calculating SLIs on every dashboard load is expensive. Recording rules pre-calculate and store results:
groups:
- name: sli_recording_rules
interval: 30s
rules:
# 5-minute payment success rate
- record: sli:payment_success_rate:5m
expr: |
sum(rate(payment_requests_total{status!~"5.."}[5m])) by (service, environment) /
sum(rate(payment_requests_total[5m])) by (service, environment)
# 30-day payment success rate for error budget
- record: sli:payment_success_rate:30d
expr: |
sum(rate(payment_requests_total{status!~"5.."}[30d])) by (service, environment) /
sum(rate(payment_requests_total[30d])) by (service, environment)
# 95th percentile latency
- record: sli:payment_latency_p95:5m
expr: |
histogram_quantile(0.95,
sum(rate(payment_request_duration_seconds_bucket[5m])) by (service, environment, le)
)
# Error budget consumption rate
- record: error_budget:payment:consumption_rate
expr: |
(1 - sli:payment_success_rate:5m) / 0.001 # 0.001 = 1 - 0.999 SLO
Aggregating service SLOs into product SLOs
PaymentPro isn't just one service - it's multiple microservices working together. How do we create product-level SLOs?
Critical path aggregation
For user journeys that require multiple services:
# User can complete payment only if ALL services work
min(
sli:payment_success_rate:5m{service="auth"},
sli:payment_success_rate:5m{service="fraud-check"},
sli:payment_success_rate:5m{service="payment-processor"},
sli:payment_success_rate:5m{service="notification"}
) by (environment)
Weighted aggregation
When services have different importance:
# Weighted by request volume
sum(
sli:payment_success_rate:5m *
sum(rate(payment_requests_total[5m])) by (service)
) / sum(sum(rate(payment_requests_total[5m])) by (service))
Building a complete SLO hierarchy
# product-slos.yaml
product: PaymentPro
components:
- service: auth
weight: 0.3
criticality: high
slo: 99.95
- service: payment-processor
weight: 0.5
criticality: critical
slo: 99.9
- service: notification
weight: 0.2
criticality: medium
slo: 99.5
product_slo:
availability: 99.9 # Calculated from components
latency_p95: 300ms # End-to-end latency
Practical implementation with sloth
Sloth generates Prometheus rules from simple SLO definitions. Here's a complete example:
# sloth-config.yaml
version: "prometheus/v1"
service: "payment-processor"
labels:
team: "payments"
product: "paymentpro"
slos:
- name: "payment-requests-availability"
objective: 99.9
description: "PaymentPro API availability"
sli:
events:
error_query: |
sum(rate(payment_requests_total{job="payment-api",code=~"5.."}[5m]))
total_query: |
sum(rate(payment_requests_total{job="payment-api"}[5m]))
alerting:
name: PaymentAvailabilityAlert
page_alert:
labels:
team: payments
severity: critical
ticket_alert:
labels:
team: payments
severity: warning
windows:
- duration: 5m
- duration: 30m
- duration: 1h
- duration: 2h
- duration: 6h
- duration: 1d
- duration: 3d
- duration: 30d
Generate Prometheus rules:
sloth generate -i sloth-config.yaml -o prometheus-rules.yaml
Moving from reactive to proactive
Here's pseudocode for implementing SLO-based decision making:
def should_deploy_new_feature():
error_budget = calculate_error_budget_remaining()
if error_budget < 0.1: # Less than 10% remaining
log.warning("Error budget critical - no deployments")
return False
if error_budget < 0.3: # Less than 30% remaining
if is_critical_security_fix():
return True
log.info("Error budget low - only critical fixes")
return False
# Healthy error budget
risk_score = calculate_deployment_risk()
budget_cost = estimate_error_budget_cost(risk_score)
if budget_cost < error_budget * 0.1: # Use max 10% of remaining
return True
return requires_architecture_review()
Implementing SLO-based on-call
Implement your on-call:
# Alert routing based on burn rate
route:
group_by: ['alertname', 'team']
routes:
# Critical: Will exhaust budget in < 6 hours
- match:
severity: critical
burn_rate: high
receiver: pagerduty-immediate
# Warning: Will exhaust budget in < 3 days
- match:
severity: warning
burn_rate: medium
receiver: slack-channel
# Info: Budget consumption above normal
- match:
severity: info
burn_rate: low
receiver: email-daily-digest
Negotiating SLAs that don't bankrupt you
When sales promises 99.99% availability without asking engineering:
def calculate_sla_penalty(availability, sla_target):
"""
Real-world SLA penalty calculation
"""
if availability >= sla_target:
return 0
breach_severity = sla_target - availability
if breach_severity <= 0.1: # Minor breach
return monthly_revenue * 0.1
elif breach_severity <= 0.5: # Major breach
return monthly_revenue * 0.25
else: # Severe breach
return monthly_revenue * 0.5
def realistic_sla_target(internal_slo, confidence=0.95):
"""
Set SLA based on SLO with safety margin
"""
safety_margin = (1 - confidence) * internal_slo
return internal_slo - safety_margin
# Example: 99.9% SLO -> 99.5% SLA
Advanced patterns for mature teams
SLO-based capacity planning
# Predict when you'll hit capacity based on growth
predict_linear(
sum(rate(payment_requests_total[1d])) by (service)[30d:1d],
86400 * 30 # 30 days forward
) > bool 1000 # Capacity limit
Composite SLIs for complex user journeys
# Complete payment flow SLI
(
# User can log in
sli:auth_success_rate:5m *
# Payment is processed
sli:payment_success_rate:5m *
# User receives confirmation
sli:notification_delivery_rate:5m *
# All within acceptable time
(sli:end_to_end_latency_p95:5m < bool 2.0)
)
Error budget policies
error_budget_policy:
triggers:
- budget_remaining: 75%
actions:
- normal_deployments
- feature_experiments
- budget_remaining: 50%
actions:
- deployments_require_approval
- no_experiments
- increase_testing
- budget_remaining: 25%
actions:
- only_critical_fixes
- mandatory_postmortems
- reliability_sprint_planning
- budget_remaining: 10%
actions:
- deployment_freeze
- all_hands_reliability
- executive_escalation
Putting it all together: Your SLO journey
Week 1-2: Instrument your code
// Add metrics to your service
var (
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "payment_request_duration_seconds",
Help: "Payment request duration",
Buckets: []float64{0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0},
},
[]string{"method", "status"},
)
)
func processPayment(w http.ResponseWriter, r *http.Request) {
start := time.Now()
status := "success"
defer func() {
duration := time.Since(start).Seconds()
requestDuration.WithLabelValues(r.Method, status).Observe(duration)
}()
// Payment logic here
}
Week 3-4: Define your first SLOs
- Start with availability (it's easiest)
- Add latency once you understand percentiles
- Don't overthink the targets - you'll adjust them
Week 5-6: Implement error budgets
- Create dashboards showing budget consumption
- Set up burn rate alerts
- Practice saying "no" when budget is low
Week 7-8: Automate policies
- Integrate with CI/CD
- Create runbooks for budget exhaustion
- Train the team on the new process
Month 3: Iterate and improve
- Review SLO targets against user feedback
- Adjust based on business needs
- Consider product-level SLOs
The SLI validation framework
Before committing to an SLI, validate it meets these criteria:
The VALID test for SLIs
def validate_sli_candidate(metric):
"""
V - Valuable to users
A - Actionable by the team
L - Logically complete
I - Implementable with current tools
D - Defensible to stakeholders
"""
checks = {
"Valuable": does_metric_impact_user_experience(metric),
"Actionable": can_team_improve_this_metric(metric),
"Logically_complete": does_metric_cover_all_failure_modes(metric),
"Implementable": can_we_measure_this_accurately(metric),
"Defensible": can_we_explain_why_this_matters(metric)
}
score = sum(1 for check in checks.values() if check)
if score == 5:
return "Strong SLI candidate"
elif score >= 3:
return "Consider with modifications"
else:
return "Not suitable as SLI"
Example validation process
Let's validate two potential SLIs:
Candidate 1: CPU Usage < 80%
- Valuable? Users don't care about CPU
- Actionable? Can optimize code or scale
- Logically complete? High CPU doesn't always mean problems
- Implementable? Easy to measure
- Defensible? Hard to justify to business Verdict: Not suitable (2/5)
Candidate 2: Payment success rate > 99.9%
- Valuable? Direct user impact
- Actionable? Can fix bugs, improve retry logic
- Logically complete? Covers all payment failures
- Implementable? Clear success/failure states
- Defensible? Easy to explain revenue impact Verdict: Strong SLI candidate (5/5)
The SLI hierarchy of needs
Not all SLIs are created equal. Prioritize them:
Level 1: Core Functionality (Must have)
├── Service reachable (availability)
├── Basic operations work (success rate)
└── Acceptable performance (latency)
Level 2: Data Integrity (Should have)
├── Data consistency
├── Durability guarantees
└── Reconciliation accuracy
Level 3: User Experience (Nice to have)
├── Feature completeness
├── Cross-service workflows
└── Advanced functionality
Implement Level 1 SLIs first, then expand based on user feedback and incidents.
Your next steps
- Start the conversation - Schedule interviews with product, development, and support teams
- Audit your metrics - List everything you currently measure and categorize as infrastructure vs application
- Pick one service - Don't boil the ocean
- Implement both perspectives - Start with infrastructure SLIs, quickly add application SLIs
- Measure current performance - You need a baseline for both layers
- Set conservative SLOs - You can always tighten later
- Implement burn rate alerts - Replace those threshold alerts
- Track error budgets - Make them visible to everyone
- Iterate based on reality - SLOs are living documents
Remember: Perfect is the enemy of good. Your first SLOs will be wrong, and that's okay. The goal isn't perfection; it's continuous improvement based on user outcomes rather than system metrics.
Top comments (0)