DEV Community

Cover image for SLOs That Product Managers Actually Understand
Samson Tanimawo
Samson Tanimawo

Posted on

SLOs That Product Managers Actually Understand

The SLO Translation Problem

You define an SLO: 99.95% availability with p99 latency under 200ms. Engineering loves it. Product managers glaze over.

The problem isn't the SLO. It's how we communicate it.

Speaking Product Language

Translate technical SLOs into business impact:

Technical SLO:                  Product translation:
───────────────                 ──────────────────────
99.95% availability             "22 minutes of downtime per month max"
p99 latency < 200ms             "The slowest 1% of users wait under 0.2s"
99.9% error-free transactions   "For every 1000 purchases, at most 1 fails"
Enter fullscreen mode Exit fullscreen mode

Suddenly, the product manager can make informed tradeoffs.

The SLO Negotiation Framework

SLOs should be negotiated between engineering and product. Here's my framework:

Step 1: Measure Current Performance

def current_performance(service, window_days=30):
    metrics = query_prometheus(f'''
        avg_over_time(
            (1 - rate(http_errors_total{{service="{service}"}}[5m]) 
             / rate(http_requests_total{{service="{service}"}}[5m]))
            [{window_days}d:1h]
        )
    ''')
    return {
        'availability': f"{metrics * 100:.3f}%",
        'monthly_downtime_minutes': round((1 - metrics) * 30 * 24 * 60, 1)
    }

# Example output:
# {'availability': '99.847%', 'monthly_downtime_minutes': 66.1}
Enter fullscreen mode Exit fullscreen mode

Step 2: Present the Cost-Reliability Tradeoff

Reliability Level | Monthly Downtime | Eng Investment  | Feature Impact
────────────────-─┼─────────────────┼────────────────-┼──────────────
99.5%  (current)  | 3.6 hours       | Baseline        | None
99.9%  (good)     | 43 minutes      | +1 SRE          | -10% velocity
99.95% (great)    | 22 minutes      | +2 SREs         | -20% velocity
99.99% (amazing)  | 4.3 minutes     | +4 SREs         | -40% velocity
Enter fullscreen mode Exit fullscreen mode

This makes the cost explicit. Most product teams choose 99.9-99.95%.

Step 3: Define SLIs That Map to User Journeys

Don't define SLOs per service. Define them per user journey:

slo_definitions:
  - name: "Checkout Success"
    description: "Users can complete a purchase"
    sli: |
      successful_checkouts / total_checkout_attempts
    target: 99.9%
    window: 30 days
    owner: payments-team
    product_owner: @sarah

  - name: "Search Responsiveness"  
    description: "Search results appear quickly"
    sli: |
      search_requests{latency < 500ms} / total_search_requests
    target: 99.5%
    window: 30 days
    owner: search-team
    product_owner: @mike

  - name: "Login Reliability"
    description: "Users can log into their accounts"
    sli: |
      successful_logins / total_login_attempts
    target: 99.99%  # Higher because login blocks everything
    window: 30 days
    owner: identity-team
    product_owner: @lisa
Enter fullscreen mode Exit fullscreen mode

Step 4: The Monthly SLO Review

We run a 30-minute monthly meeting with engineering leads AND product managers:

Agenda:
1. SLO status dashboard review (5 min)
   - Which SLOs are healthy? (green)
   - Which are at risk? (yellow)
   - Which were breached? (red)

2. Budget impact (10 min)
   - Error budget consumed per SLO
   - Projected budget at current burn rate
   - Feature freeze triggers

3. Tradeoff decisions (15 min)
   - Feature X requires relaxing SLO Y — approve?
   - Incident Z consumed 40% of budget — invest in fix?
   - New service launching — what SLO target?
Enter fullscreen mode Exit fullscreen mode

The Dashboard That Changed Everything

We built a single-page SLO dashboard with three views:

  1. Executive view: Traffic lights per user journey. Green/Yellow/Red.
  2. Product view: Error budget remaining + projected depletion date.
  3. Engineering view: Burn rate charts + contributing incidents.

Same data, different lens. Everyone gets what they need.

Key Insight

SLOs are a communication tool first, a technical tool second. If only engineers understand your SLOs, they're not working.

If you want SLOs that automatically track, alert, and report in plain language, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)