DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production

How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production

Comments
12 min read
The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

1
Comments
14 min read
FinOps for SREs: Cutting Costs Without Breaking Things

FinOps for SREs: Cutting Costs Without Breaking Things

1
Comments
3 min read
The Silent Process

The Silent Process

1
Comments
3 min read
How We Stopped Fighting Enterprise Auth and Read Calendars With a URL

How We Stopped Fighting Enterprise Auth and Read Calendars With a URL

1
Comments
8 min read
When Everything Is On Fire: Incident Communication That Engineers (and Users) Can Trust

When Everything Is On Fire: Incident Communication That Engineers (and Users) Can Trust

Comments
5 min read
Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

Comments
6 min read
Monitoring Tools Comparison 2026: VigilOps vs Zabbix vs Prometheus vs Datadog

Monitoring Tools Comparison 2026: VigilOps vs Zabbix vs Prometheus vs Datadog

3
Comments
2 min read
The Worlds of Distributed Systems — Align Your Team’s Mental Model

The Worlds of Distributed Systems — Align Your Team’s Mental Model

Comments
5 min read
Why LeetCode Habits Get Senior Engineers Rejected in Google SRE Coding Rounds

Why LeetCode Habits Get Senior Engineers Rejected in Google SRE Coding Rounds

1
Comments
4 min read
Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)

Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)

Comments
6 min read
A 10% traffic spike took down a stable system in 3 minutes and 47 seconds.

A 10% traffic spike took down a stable system in 3 minutes and 47 seconds.

3
Comments
3 min read
The Big Tech Reality Check: Why "Senior" Architecture Fails at Global Scale

The Big Tech Reality Check: Why "Senior" Architecture Fails at Global Scale

Comments 1
3 min read
What Actually Happens When You Put an AI Agent on Call

What Actually Happens When You Put an AI Agent on Call

9
Comments 2
3 min read
Why is Infrastructure-as-Code so important? Hint: It's correctness

Why is Infrastructure-as-Code so important? Hint: It's correctness

Comments
2 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.