Aisalkyn Aidarova

Posted on May 7

THE MOST IMPORTANT CONCEPT: MEASURING RELIABILITY: SLO, SLA, SLI

Site Reliability Engineering is not just monitoring or fixing servers.

It is:

Applying software engineering principles to operations to make systems reliable at scale.

That means:

You don’t manually fix things → you automate
You don’t guess → you measure
You don’t react → you design for failure

Core mindset

A normal engineer asks:

“Is the system working?”

An SRE asks:

“How well is it working, how often does it fail, and how much failure is acceptable?”

Before SRE existed, companies said:

System should be reliable

That means nothing.

SRE changed that to:

Reliability must be measurable

This is where SLI, SLO, SLA come in.

🧠 PART 3 — SLI (SERVICE LEVEL INDICATOR)

What it really is

An SLI is:

A real measurement of user experience

Not system metrics like CPU — but user-facing metrics.

Examples

Instead of:

CPU = 70%

We measure:

Request success rate
Request latency
Error rate

Real example

Imagine your API:

1000 requests
990 succeed
10 fail

Your SLI:

Success rate = 99%

Important rule

SLI must reflect USER experience

If user is unhappy → your SLI is wrong

🧠 PART 4 — SLO (SERVICE LEVEL OBJECTIVE)

What it really is

SLO is:

A target you set for your system performance

Example

You define:

99.9% of requests must succeed

That is your SLO.

Why SLO exists

Because perfection is impossible.

So instead of:

System must never fail ❌

We say:

System can fail within limits ✅

Another example

Latency SLO:

95% of requests < 200ms

Key idea

SLO defines acceptable reliability

🧠 PART 5 — SLA (SERVICE LEVEL AGREEMENT)

What it really is

SLA is:

Business contract based on SLO

Example

If uptime < 99.9% → customer gets refund

Important difference

Concept	Purpose
SLI	measurement
SLO	internal goal
SLA	external contract

🧠 PART 6 — ERROR BUDGET (THIS IS SENIOR LEVEL)

This is the most important concept in SRE.

What it is

If your SLO is:

99.9% uptime

Then:

0.1% failure is allowed

That is your error budget

In real time

~43 minutes downtime per month

Why it matters

It creates balance:

Developers → want speed
SRE → want stability

Error budget decides:

If budget remains → deploy
If exhausted → stop releases

Real rule

No error budget = no deployments

🧠 PART 7 — HOW WE MEASURE AVAILABILITY

Formula

Availability =

(Total time - downtime) / total time

Example

30 days = 720 hours
Downtime = 2 hours

(720 - 2) / 720 = 99.72%

SRE levels

Level	Meaning
99%	basic
99.9%	production
99.99%	critical
99.999%	extreme

🧠 PART 8 — LATENCY (WHY AVERAGE IS WRONG)

Average lies.

Example

99 requests = 100ms
1 request = 10 seconds

Average looks fine — but system is broken.

Solution

Use percentiles:

P50 → normal
P95 → slow users
P99 → worst users

Real SLO

95% of requests < 200ms

🧠 PART 9 — MONITORING (WHAT SRE ACTUALLY WATCHES)

Golden Signals (Google SRE)

Latency
Traffic
Errors
Saturation

What this means

You monitor:

How fast?
How many?
How broken?
How loaded?

Tools

Prometheus
Grafana
CloudWatch
ELK

🧠 PART 10 — ALERTING (VERY IMPORTANT)

Bad alert:

CPU > 80%

Good alert:

Error rate > 5% for 5 minutes

Rule

Alert only when users are impacted

🧠 PART 11 — INCIDENT MANAGEMENT

Incident = system failure affecting users

SRE process

Detect
Respond
Fix
Learn

Postmortem

Must be:

Blameless

You document

timeline
root cause
impact
fix
prevention

🧠 PART 12 — RELIABILITY ENGINEERING

You design systems that:

Expect failure

Example

Instead of 1 server:

ALB → multiple EC2 → DB replicas

Goal

No single point of failure

🧠 PART 13 — SCALING

Vertical

bigger machine

Horizontal

more machines

SRE prefers

Horizontal scaling

🧠 PART 14 — NETWORKING (WHAT YOU DID)

You must understand:

VPC
routing
NAT vs IGW
TGW
PrivateLink

🧠 PART 15 — AUTOMATION

Rule:

If you repeat it → automate it

Tools

Terraform
Bash
Python

🧠 PART 16 — CI/CD

You must know:

pipelines
deployments
rollback

Strategies

rolling
blue/green
canary

🧠 FINAL UNDERSTANDING

SRE is:

Measure → Define → Monitor → Improve → Automate

💬 PERFECT INTERVIEW ANSWER

SRE focuses on maintaining system reliability by defining measurable objectives like SLOs, monitoring system health, managing incidents, and automating infrastructure while balancing system stability with development velocity.