DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

THE MOST IMPORTANT CONCEPT: MEASURING RELIABILITY: SLO, SLA, SLI

Site Reliability Engineering is not just monitoring or fixing servers.

It is:

Applying software engineering principles to operations to make systems reliable at scale.

That means:

  • You don’t manually fix things → you automate
  • You don’t guess → you measure
  • You don’t react → you design for failure

Core mindset

A normal engineer asks:

“Is the system working?”

An SRE asks:

“How well is it working, how often does it fail, and how much failure is acceptable?”


Before SRE existed, companies said:

System should be reliable
Enter fullscreen mode Exit fullscreen mode

That means nothing.

SRE changed that to:

Reliability must be measurable
Enter fullscreen mode Exit fullscreen mode

This is where SLI, SLO, SLA come in.


🧠 PART 3 — SLI (SERVICE LEVEL INDICATOR)

What it really is

An SLI is:

A real measurement of user experience

Not system metrics like CPU — but user-facing metrics.


Examples

Instead of:

CPU = 70%
Enter fullscreen mode Exit fullscreen mode

We measure:

Request success rate
Request latency
Error rate
Enter fullscreen mode Exit fullscreen mode

Real example

Imagine your API:

  • 1000 requests
  • 990 succeed
  • 10 fail

Your SLI:

Success rate = 99%
Enter fullscreen mode Exit fullscreen mode

Important rule

SLI must reflect USER experience
Enter fullscreen mode Exit fullscreen mode

If user is unhappy → your SLI is wrong


🧠 PART 4 — SLO (SERVICE LEVEL OBJECTIVE)

What it really is

SLO is:

A target you set for your system performance


Example

You define:

99.9% of requests must succeed
Enter fullscreen mode Exit fullscreen mode

That is your SLO.


Why SLO exists

Because perfection is impossible.

So instead of:

System must never fail ❌
Enter fullscreen mode Exit fullscreen mode

We say:

System can fail within limits ✅
Enter fullscreen mode Exit fullscreen mode

Another example

Latency SLO:

95% of requests < 200ms
Enter fullscreen mode Exit fullscreen mode

Key idea

SLO defines acceptable reliability
Enter fullscreen mode Exit fullscreen mode

🧠 PART 5 — SLA (SERVICE LEVEL AGREEMENT)

What it really is

SLA is:

Business contract based on SLO


Example

If uptime < 99.9% → customer gets refund
Enter fullscreen mode Exit fullscreen mode

Important difference

Concept Purpose
SLI measurement
SLO internal goal
SLA external contract

🧠 PART 6 — ERROR BUDGET (THIS IS SENIOR LEVEL)

This is the most important concept in SRE.


What it is

If your SLO is:

99.9% uptime
Enter fullscreen mode Exit fullscreen mode

Then:

0.1% failure is allowed
Enter fullscreen mode Exit fullscreen mode

That is your error budget


In real time

~43 minutes downtime per month
Enter fullscreen mode Exit fullscreen mode

Why it matters

It creates balance:

Developers → want speed
SRE → want stability
Enter fullscreen mode Exit fullscreen mode

Error budget decides:

If budget remains → deploy
If exhausted → stop releases
Enter fullscreen mode Exit fullscreen mode

Real rule

No error budget = no deployments
Enter fullscreen mode Exit fullscreen mode

🧠 PART 7 — HOW WE MEASURE AVAILABILITY

Formula

Availability =

(Total time - downtime) / total time
Enter fullscreen mode Exit fullscreen mode

Example

30 days = 720 hours
Downtime = 2 hours

(720 - 2) / 720 = 99.72%
Enter fullscreen mode Exit fullscreen mode

SRE levels

Level Meaning
99% basic
99.9% production
99.99% critical
99.999% extreme

🧠 PART 8 — LATENCY (WHY AVERAGE IS WRONG)

Average lies.


Example

99 requests = 100ms
1 request = 10 seconds
Enter fullscreen mode Exit fullscreen mode

Average looks fine — but system is broken.


Solution

Use percentiles:

  • P50 → normal
  • P95 → slow users
  • P99 → worst users

Real SLO

95% of requests < 200ms
Enter fullscreen mode Exit fullscreen mode

🧠 PART 9 — MONITORING (WHAT SRE ACTUALLY WATCHES)

Golden Signals (Google SRE)

  1. Latency
  2. Traffic
  3. Errors
  4. Saturation

What this means

You monitor:

How fast?
How many?
How broken?
How loaded?
Enter fullscreen mode Exit fullscreen mode

Tools

  • Prometheus
  • Grafana
  • CloudWatch
  • ELK

🧠 PART 10 — ALERTING (VERY IMPORTANT)

Bad alert:

CPU > 80%
Enter fullscreen mode Exit fullscreen mode

Good alert:

Error rate > 5% for 5 minutes
Enter fullscreen mode Exit fullscreen mode

Rule

Alert only when users are impacted
Enter fullscreen mode Exit fullscreen mode

🧠 PART 11 — INCIDENT MANAGEMENT

Incident = system failure affecting users


SRE process

  1. Detect
  2. Respond
  3. Fix
  4. Learn

Postmortem

Must be:

Blameless
Enter fullscreen mode Exit fullscreen mode

You document

  • timeline
  • root cause
  • impact
  • fix
  • prevention

🧠 PART 12 — RELIABILITY ENGINEERING

You design systems that:

Expect failure
Enter fullscreen mode Exit fullscreen mode

Example

Instead of 1 server:

ALB → multiple EC2 → DB replicas
Enter fullscreen mode Exit fullscreen mode

Goal

No single point of failure
Enter fullscreen mode Exit fullscreen mode

🧠 PART 13 — SCALING

Vertical

bigger machine
Enter fullscreen mode Exit fullscreen mode

Horizontal

more machines
Enter fullscreen mode Exit fullscreen mode

SRE prefers

Horizontal scaling
Enter fullscreen mode Exit fullscreen mode

🧠 PART 14 — NETWORKING (WHAT YOU DID)

You must understand:

  • VPC
  • routing
  • NAT vs IGW
  • TGW
  • PrivateLink

🧠 PART 15 — AUTOMATION

Rule:

If you repeat it → automate it
Enter fullscreen mode Exit fullscreen mode

Tools

  • Terraform
  • Bash
  • Python

🧠 PART 16 — CI/CD

You must know:

  • pipelines
  • deployments
  • rollback

Strategies

  • rolling
  • blue/green
  • canary

🧠 FINAL UNDERSTANDING

SRE is:

Measure → Define → Monitor → Improve → Automate
Enter fullscreen mode Exit fullscreen mode

💬 PERFECT INTERVIEW ANSWER

SRE focuses on maintaining system reliability by defining measurable objectives like SLOs, monitoring system health, managing incidents, and automating infrastructure while balancing system stability with development velocity.

Top comments (0)