Site Reliability Engineering is not just monitoring or fixing servers.
It is:
Applying software engineering principles to operations to make systems reliable at scale.
That means:
- You don’t manually fix things → you automate
- You don’t guess → you measure
- You don’t react → you design for failure
Core mindset
A normal engineer asks:
“Is the system working?”
An SRE asks:
“How well is it working, how often does it fail, and how much failure is acceptable?”
Before SRE existed, companies said:
System should be reliable
That means nothing.
SRE changed that to:
Reliability must be measurable
This is where SLI, SLO, SLA come in.
🧠 PART 3 — SLI (SERVICE LEVEL INDICATOR)
What it really is
An SLI is:
A real measurement of user experience
Not system metrics like CPU — but user-facing metrics.
Examples
Instead of:
CPU = 70%
We measure:
Request success rate
Request latency
Error rate
Real example
Imagine your API:
- 1000 requests
- 990 succeed
- 10 fail
Your SLI:
Success rate = 99%
Important rule
SLI must reflect USER experience
If user is unhappy → your SLI is wrong
🧠 PART 4 — SLO (SERVICE LEVEL OBJECTIVE)
What it really is
SLO is:
A target you set for your system performance
Example
You define:
99.9% of requests must succeed
That is your SLO.
Why SLO exists
Because perfection is impossible.
So instead of:
System must never fail ❌
We say:
System can fail within limits ✅
Another example
Latency SLO:
95% of requests < 200ms
Key idea
SLO defines acceptable reliability
🧠 PART 5 — SLA (SERVICE LEVEL AGREEMENT)
What it really is
SLA is:
Business contract based on SLO
Example
If uptime < 99.9% → customer gets refund
Important difference
| Concept | Purpose |
|---|---|
| SLI | measurement |
| SLO | internal goal |
| SLA | external contract |
🧠 PART 6 — ERROR BUDGET (THIS IS SENIOR LEVEL)
This is the most important concept in SRE.
What it is
If your SLO is:
99.9% uptime
Then:
0.1% failure is allowed
That is your error budget
In real time
~43 minutes downtime per month
Why it matters
It creates balance:
Developers → want speed
SRE → want stability
Error budget decides:
If budget remains → deploy
If exhausted → stop releases
Real rule
No error budget = no deployments
🧠 PART 7 — HOW WE MEASURE AVAILABILITY
Formula
Availability =
(Total time - downtime) / total time
Example
30 days = 720 hours
Downtime = 2 hours
(720 - 2) / 720 = 99.72%
SRE levels
| Level | Meaning |
|---|---|
| 99% | basic |
| 99.9% | production |
| 99.99% | critical |
| 99.999% | extreme |
🧠 PART 8 — LATENCY (WHY AVERAGE IS WRONG)
Average lies.
Example
99 requests = 100ms
1 request = 10 seconds
Average looks fine — but system is broken.
Solution
Use percentiles:
- P50 → normal
- P95 → slow users
- P99 → worst users
Real SLO
95% of requests < 200ms
🧠 PART 9 — MONITORING (WHAT SRE ACTUALLY WATCHES)
Golden Signals (Google SRE)
- Latency
- Traffic
- Errors
- Saturation
What this means
You monitor:
How fast?
How many?
How broken?
How loaded?
Tools
- Prometheus
- Grafana
- CloudWatch
- ELK
🧠 PART 10 — ALERTING (VERY IMPORTANT)
Bad alert:
CPU > 80%
Good alert:
Error rate > 5% for 5 minutes
Rule
Alert only when users are impacted
🧠 PART 11 — INCIDENT MANAGEMENT
Incident = system failure affecting users
SRE process
- Detect
- Respond
- Fix
- Learn
Postmortem
Must be:
Blameless
You document
- timeline
- root cause
- impact
- fix
- prevention
🧠 PART 12 — RELIABILITY ENGINEERING
You design systems that:
Expect failure
Example
Instead of 1 server:
ALB → multiple EC2 → DB replicas
Goal
No single point of failure
🧠 PART 13 — SCALING
Vertical
bigger machine
Horizontal
more machines
SRE prefers
Horizontal scaling
🧠 PART 14 — NETWORKING (WHAT YOU DID)
You must understand:
- VPC
- routing
- NAT vs IGW
- TGW
- PrivateLink
🧠 PART 15 — AUTOMATION
Rule:
If you repeat it → automate it
Tools
- Terraform
- Bash
- Python
🧠 PART 16 — CI/CD
You must know:
- pipelines
- deployments
- rollback
Strategies
- rolling
- blue/green
- canary
🧠 FINAL UNDERSTANDING
SRE is:
Measure → Define → Monitor → Improve → Automate
💬 PERFECT INTERVIEW ANSWER
SRE focuses on maintaining system reliability by defining measurable objectives like SLOs, monitoring system health, managing incidents, and automating infrastructure while balancing system stability with development velocity.
Top comments (0)