- SRE is not about hope-based reliability.
- It is about numbers, thresholds, and consequences.
- At the core of SRE are SLIs, SLOs, and SLAs.
- If you cannot measure it, you cannot make it reliable.
π― Big Picture (Mental Model)
- π’ SLIs β What we measure
- π‘ SLOs β What we aim for
- π΄ SLAs β What we promise legally
SLIs feed SLAs, and SLAs define SLOs
π Service Level Indicators (SLIs)
β Raw measurements of service behavior
SLIs are quantitative metrics that describe how a service behaves from the user's perspective.
π Common SLI Types
| SLI Type | What It Measures |
|---|---|
| π’ Availability | Was the service reachable |
| β‘ Latency | How fast responses are |
| β Error Rate | How many requests failed |
| π¦ Throughput | Requests per second |
| π Freshness | Data staleness |
β Example SLIs (API Service)
Availability SLI = Successful requests / Total requests
Latency SLI = % of requests under 300ms
Error Rate SLI = 5xx responses / Total requests
π SLIs do not define targets.
They only provide truthful signals.
π― Service Level Objectives (SLOs)
β Target reliability goals
SLOs define how reliable the service must be.
They are engineering targets, not legal contracts.
π’ Example SLOs
Availability SLO: 99.9% monthly uptime
Latency SLO: 95% of requests under 300ms
Error Rate SLO: Less than 0.1% failed requests
π SLOs are based on user expectations, not perfection.
π₯ Error Budget (Why SLOs Matter)
99.9% uptime = 43.2 minutes of downtime per month
That downtime is your error budget.
| If error budget exists | If error budget is exhausted |
|---|---|
| π Ship features | π Freeze releases |
| π§ͺ Experiment | π§ Focus on stability |
This is SRE discipline in action.
π Service Level Agreements (SLAs)
β Legal and business commitments
SLAs are contracts with customers.
They reference SLIs and define:
- Acceptable performance
- Measurement windows
- Penalties or credits
π§Ύ Example SLA Clause
The service will maintain 99.5% monthly availability.
If availability falls below 99.5%, customers receive a 10% service credit.
π SLAs are intentionally less strict than SLOs.
Why?
Because breaking an SLA costs money and trust.
π How SLIs, SLOs, and SLAs Connect
SLI β Measured data
SLA β Contractual minimums using SLIs
SLO β Internal reliability target set above SLA
π§ Visual Flow
π SLIs (metrics)
β
π SLAs (legal promises)
β
π― SLOs (engineering goals)
ποΈ Real-World Example (E-commerce App)
π SLIs
- Availability: Successful HTTP responses
- Latency: Request duration
- Error Rate: 5xx responses
π SLA (Customer-Facing)
99.5% monthly availability
π― SLO (Engineering Target)
99.9% monthly availability
95% requests < 250ms
Error rate < 0.1%
Why higher than SLA?
β Buffer for incidents
β Protect customer trust
β Avoid financial penalties
β Common Mistakes (Callout)
π« Setting SLOs without SLIs
π« 100% uptime targets
π« SLAs tighter than SLOs
π« Measuring system metrics instead of user experience
β SRE Golden Rules
- Measure what users feel
- Target less than perfect
- Use error budgets to guide decisions
- Protect engineers from endless firefighting
π Final Takeaway
SLIs tell the truth
SLOs define reliability goals
SLAs define consequences
This trio is what turns reliability from wishful thinking into engineering discipline πͺ
Top comments (0)