What is SRE? A Beginner's Guide to Site Reliability Engineering

Jitul Kumar Laphong — Mon, 15 Jun 2026 03:15:41 +0000

Why This Matters: The 2 AM Problem

It's 2 AM. Your phone rings. Your production database is down. Customers can't log in. Revenue is dropping by the second.

You call the Ops team. They restart the server. Downtime: 45 minutes. Cost: $100K in lost sales. Root cause? Unknown.

This happens thousands of times a week at companies worldwide.

The question isn't "Will your system break?" It's "When it breaks, are you ready?"

That's where SRE comes in.

What is SRE? (The Real Definition)

SRE (Site Reliability Engineering) = Applying software engineering principles to build reliable, scalable infrastructure and systems.

It's not just about keeping servers running. It's about:

Reliability: Systems that don't break unexpectedly
Scalability: Systems that handle growth without collapsing
Infrastructure: Automating how systems are built, deployed, and monitored
Measurability: Knowing exactly how your system is performing at any moment

Traditional operations manages infrastructure reactively — when something breaks, you fix it.

SRE manages infrastructure proactively — you engineer it so it rarely breaks, and when it does, it heals itself.

The Key Insight: Reliability is Engineered, Not Hoped For

Here's the critical shift in thinking:

Old mindset: "Let's build this system and hope it doesn't break."

SRE mindset: "Let's measure what 'reliable' means, design the system to achieve that, and automate the monitoring and recovery."

But reliability isn't just uptime. It includes:

Uptime: Is the system available?
Latency: How fast does it respond? (A slow system is effectively broken)
Error rate: What percentage of requests fail?
Throughput: Can it handle the traffic?
User experience: Does the system meet user expectations?

All of these are engineered and measured.

A Simple Analogy: The Bridge

Imagine you're managing a bridge.

Traditional approach:

Engineers patrol daily, react to problems, work around the clock fixing issues

SRE approach:

Engineers design monitoring that alerts you to problems early
They automate repairs and maintenance
They engineer the bridge so problems are rare
Engineers focus on prevention, not firefighting

Same outcome (a working bridge), different philosophy.

How SLI, SLO, and SLA Work Together

These three concepts are the backbone of SRE. They work as a connected flow:

SLI (Service Level Indicator) → SLO (Service Level Objective) → SLA (Service Level Agreement)
Real Example: An E-commerce Platform

SLI (What you measure):
"99.92% of checkout requests succeeded today"

SLO (What you target internally):
"We aim for 99.95% checkout success rate"

SLA (What you promise customers):
"We guarantee 99.9% checkout availability"

Why three different numbers?

SLI = reality (what actually happened)
SLO = internal target (stricter than SLA, gives you a buffer)
SLA = customer promise (contractual)

If your SLI shows 99.92% — you're between your SLO and SLA. Safe, but watch it.
If it drops to 99.88% — you're breaking your SLA promise. Stop, investigate, fix.

Key SRE Terminologies

Error Budget
How much downtime/failure you can afford while meeting your SLA

If SLA is 99.9%, you get ~43 minutes of downtime/month
Once used up, you pause risky deployments and focus on stability

Toil
Manual, repetitive work that doesn't improve the system long-term

Example of toil: Manually restarting failed services every week
Example of not toil: Writing automation to restart services automatically
SRE goal: Eliminate toil through engineering

On-Call
Being responsible for the system outside normal hours

When systems break after hours, on-call engineers get paged
Good SRE: Systems auto-heal; minimal pages at 2 AM

Incident
When something goes wrong in production

SRE engineers are trained in rapid response
Goal: Fix it fast, then prevent it from happening again

Post-Mortem (Blameless Review)
The learning session after an incident

"What happened and why?"
"What can we automate to prevent this next time?"
Mindset: No blame, just learning

Real-World Scenario: How SRE Differs from Traditional Operations

Scenario: Your database query is slow, causing checkout delays.

Traditional Operations Approach:

Users complain about slow checkout
Ops team gets paged
They add more servers/resources (quick fix)
System speeds up temporarily
A week later, same problem returns
Repeat cycle: more servers, more cost, same root cause

Cost: $50K in extra infrastructure/month, constant firefighting

SRE Approach:

Monitoring (SLI) detects latency increase before users notice
On-call engineer gets paged
They quickly restore service (restore fast)
Then, during business hours, they investigate: Why did this happen?
They find: The query is inefficient. They optimize it.
They automate the monitoring so the next degradation is caught instantly
Root cause solved. Problem unlikely to return.

Cost: One engineer, 4 hours of work, problem fixed permanently
The difference: Traditional Ops reacts to symptoms. SRE engineers the root cause away.

DevOps vs SRE: What's the Real Difference?

DevOps = A culture and philosophy

"Breaking down walls between developers and operations"
Mindset: Developers should understand infrastructure. Ops should understand code.
Goal: Faster, safer deployments through collaboration

SRE = An engineering discipline

Specific practices: Measurement, automation, incident response, error budgets
Methodology: How you actually implement the DevOps philosophy
Goal: Reliable, scalable systems engineered to prevent failure

The relationship: DevOps is the what (we should collaborate). SRE is the how (here's the discipline to do it).

Many successful DevOps transformations are powered by SRE practices.

My Journey: Performance Testing → DevOps → SRE

I didn't start as an SRE. My progression shows how these are interconnected:

Phase 1: Performance Testing
I ran load tests: "Can this system handle 10,000 concurrent users?"

I found bottlenecks and failure modes under stress
Key insight: Understanding system behavior under load is critical

Phase 2: DevOps
I automated deployments and managed infrastructure

I built CI/CD pipelines and infrastructure-as-code
Key insight: Automation prevents manual errors, but it's not enough

Phase 3: SRE
I realized these connect:

Performance testing data informs SLO definition
SLOs drive automation decisions in DevOps
Monitoring feeds back into performance testing

The lesson: These aren't separate disciplines. They're interconnected. Performance testing → tells you system limits → informs SLO definition → drives DevOps automation → creates reliable systems.

The SRE Mindset

If you're considering SRE, you need:

Engineering mindset: "How do I automate this?" not "How do I fix this faster?"
Measurement obsession: "If I can't measure it, I don't understand it"
Ownership: "This system is my responsibility — it should not break on my watch"
Systems thinking: Reliability is about the whole system, not individual components

The Bottom Line

SRE is not about eliminating all failures. It's about engineering systems to fail gracefully and recover automatically.

You can't prevent every outage. But you can:

Measure reliability precisely
Know when you're about to break customer promises
Automate recovery so 2 AM incidents don't require a human
Learn from failures and prevent repeats

This is what separates companies that have reliable systems from companies that get paged at 2 AM.

What's Next?

You now understand what SRE is. The next article dives deeper: "SRE Terminologies Deep Dive: SLI, SLO, SLA, and Error Budgets Explained."

But first, ask yourself: When your system breaks next, will you fix the symptom or engineer away the root cause?

That's the SRE question.

Key Takeaways

SRE = Engineering discipline for building reliable, scalable systems
Reliability = Uptime + latency + error rate + throughput + user experience (all measured)
SLI/SLO/SLA = Connected flow: Measure → Target → Promise
Toil elimination drives automation and system improvement
Performance Testing → DevOps → SRE are interconnected disciplines
SRE is philosophy of proactive engineering, not reactive firefighting

DEV Community: Jitul Kumar Laphong