DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
FROM ALERTS TO ACTION: Autonomous Incident Response Agent | Engineering & Devops

FROM ALERTS TO ACTION: Autonomous Incident Response Agent | Engineering & Devops

1
Comments
6 min read
3 on-call rotation mistakes that burn out your best engineers first

3 on-call rotation mistakes that burn out your best engineers first

3
Comments 1
2 min read
Go Context Timeouts That Save Real Money

Go Context Timeouts That Save Real Money

Comments
9 min read
From ticket to PR with agents: how to use Claude to automate platform changes without breaking SLOs

From ticket to PR with agents: how to use Claude to automate platform changes without breaking SLOs

Comments
3 min read
Your Engineers Aren't Slow. Your incident response is. Here's Where the First 20 Minutes Actually Go

Your Engineers Aren't Slow. Your incident response is. Here's Where the First 20 Minutes Actually Go

10
Comments 1
7 min read
How a Simple Python Validator Prevents Config Outages

How a Simple Python Validator Prevents Config Outages

Comments 2
3 min read
The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)

The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)

4
Comments 2
6 min read
SRE Explained: Because 'It Works on My Machine' is Not an SLO 🎯

SRE Explained: Because 'It Works on My Machine' is Not an SLO 🎯

3
Comments
9 min read
CloudFormation in Production: What Breaks and How to Fix It

CloudFormation in Production: What Breaks and How to Fix It

1
Comments
11 min read
Why Your Monitoring Is Failing in Microservices (And What Actually Works)

Why Your Monitoring Is Failing in Microservices (And What Actually Works)

1
Comments
3 min read
3am Incident Response: What I Learned from 200+ Pages

3am Incident Response: What I Learned from 200+ Pages

Comments
2 min read
Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

Comments
2 min read
Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

Comments
2 min read
Failure Semantics in Distributed Financial Systems: What Does “Failure” Actually Mean?

Failure Semantics in Distributed Financial Systems: What Does “Failure” Actually Mean?

Comments
4 min read
Error Budgets in Practice: A No-BS Guide

Error Budgets in Practice: A No-BS Guide

Comments
2 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.