DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
One insight that changed how I design systems:

One insight that changed how I design systems:

Comments
1 min read
The Nines Are Lying to You: What 99.9% Uptime Actually Costs

The Nines Are Lying to You: What 99.9% Uptime Actually Costs

2
Comments 1
4 min read
I built an AI tool for incident investigation (looking for honest feedback)

I built an AI tool for incident investigation (looking for honest feedback)

1
Comments
2 min read
Determinism Series: Siliconizing Decision-Making (Index)

Determinism Series: Siliconizing Decision-Making (Index)

1
Comments
4 min read
Why status page aggregators matter for engineering teams

Why status page aggregators matter for engineering teams

3
Comments
4 min read
FROM ALERTS TO ACTION: Autonomous Incident Response Agent | Engineering & Devops

FROM ALERTS TO ACTION: Autonomous Incident Response Agent | Engineering & Devops

1
Comments
6 min read
3 on-call rotation mistakes that burn out your best engineers first

3 on-call rotation mistakes that burn out your best engineers first

3
Comments 1
2 min read
Go Context Timeouts That Save Real Money

Go Context Timeouts That Save Real Money

Comments
9 min read
From ticket to PR with agents: how to use Claude to automate platform changes without breaking SLOs

From ticket to PR with agents: how to use Claude to automate platform changes without breaking SLOs

Comments
3 min read
Your Engineers Aren't Slow. Your incident response is. Here's Where the First 20 Minutes Actually Go

Your Engineers Aren't Slow. Your incident response is. Here's Where the First 20 Minutes Actually Go

10
Comments 1
7 min read
How a Simple Python Validator Prevents Config Outages

How a Simple Python Validator Prevents Config Outages

Comments 2
3 min read
The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)

The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)

4
Comments 2
6 min read
SRE Explained: Because 'It Works on My Machine' is Not an SLO 🎯

SRE Explained: Because 'It Works on My Machine' is Not an SLO 🎯

3
Comments
9 min read
CloudFormation in Production: What Breaks and How to Fix It

CloudFormation in Production: What Breaks and How to Fix It

1
Comments
11 min read
Why Your Monitoring Is Failing in Microservices (And What Actually Works)

Why Your Monitoring Is Failing in Microservices (And What Actually Works)

1
Comments
3 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.