Shiva Charan

Posted on Jan 20

🎯 SLI, SLO, SLA Explained 🎯

#beginners #devops #monitoring #softwareengineering

SRE is not about hope-based reliability.
It is about numbers, thresholds, and consequences.
At the core of SRE are SLIs, SLOs, and SLAs.
If you cannot measure it, you cannot make it reliable.

🎯 Big Picture (Mental Model)

🟢 SLIs → What we measure
🟡 SLOs → What we aim for
🔴 SLAs → What we promise legally

SLIs feed SLAs, and SLAs define SLOs

📏 Service Level Indicators (SLIs)

✔ Raw measurements of service behavior

SLIs are quantitative metrics that describe how a service behaves from the user's perspective.

🔍 Common SLI Types

SLI Type	What It Measures
🟢 Availability	Was the service reachable
⚡ Latency	How fast responses are
❌ Error Rate	How many requests failed
📦 Throughput	Requests per second
🔄 Freshness	Data staleness

✅ Example SLIs (API Service)

Availability SLI = Successful requests / Total requests
Latency SLI = % of requests under 300ms
Error Rate SLI = 5xx responses / Total requests

📌 SLIs do not define targets.
They only provide truthful signals.

🎯 Service Level Objectives (SLOs)

✔ Target reliability goals

SLOs define how reliable the service must be.

They are engineering targets, not legal contracts.

🔢 Example SLOs

Availability SLO: 99.9% monthly uptime
Latency SLO: 95% of requests under 300ms
Error Rate SLO: Less than 0.1% failed requests

📌 SLOs are based on user expectations, not perfection.

🔥 Error Budget (Why SLOs Matter)

99.9% uptime = 43.2 minutes of downtime per month

That downtime is your error budget.

If error budget exists	If error budget is exhausted
🚀 Ship features	🛑 Freeze releases
🧪 Experiment	🔧 Focus on stability

This is SRE discipline in action.

📜 Service Level Agreements (SLAs)

✔ Legal and business commitments

SLAs are contracts with customers.

They reference SLIs and define:

Acceptable performance
Measurement windows
Penalties or credits

🧾 Example SLA Clause

The service will maintain 99.5% monthly availability.
If availability falls below 99.5%, customers receive a 10% service credit.

📌 SLAs are intentionally less strict than SLOs.

Why?
Because breaking an SLA costs money and trust.

🔗 How SLIs, SLOs, and SLAs Connect

SLI → Measured data
SLA → Contractual minimums using SLIs
SLO → Internal reliability target set above SLA

🧠 Visual Flow

📊 SLIs (metrics)
      ↓
📜 SLAs (legal promises)
      ↓
🎯 SLOs (engineering goals)

🏗️ Real-World Example (E-commerce App)

📊 SLIs

Availability: Successful HTTP responses
Latency: Request duration
Error Rate: 5xx responses

📜 SLA (Customer-Facing)

99.5% monthly availability

🎯 SLO (Engineering Target)

99.9% monthly availability
95% requests < 250ms
Error rate < 0.1%

Why higher than SLA?

✔ Buffer for incidents
✔ Protect customer trust
✔ Avoid financial penalties

❌ Common Mistakes (Callout)

🚫 Setting SLOs without SLIs
🚫 100% uptime targets
🚫 SLAs tighter than SLOs
🚫 Measuring system metrics instead of user experience

✅ SRE Golden Rules

Measure what users feel
Target less than perfect
Use error budgets to guide decisions
Protect engineers from endless firefighting

🏁 Final Takeaway

SLIs tell the truth
SLOs define reliability goals
SLAs define consequences

This trio is what turns reliability from wishful thinking into engineering discipline 💪

DEV Community