DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

Comments
4 min read
Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Comments
5 min read
What is SRE? A Beginner's Guide to Site Reliability Engineering

What is SRE? A Beginner's Guide to Site Reliability Engineering

Comments
5 min read
Ongrid : open-source ops AI agent for RCA and remediation from chat

Ongrid : open-source ops AI agent for RCA and remediation from chat

Comments
1 min read
Incident Automation: What to Automate, What to Leave to Humans

Incident Automation: What to Automate, What to Leave to Humans

Comments
2 min read
I built a small tool to answer a question I’ve asked too many times: is this production ready?

I built a small tool to answer a question I’ve asked too many times: is this production ready?

Comments
2 min read
What Building Website Monitoring Taught Me About Silent Failures

What Building Website Monitoring Taught Me About Silent Failures

Comments
5 min read
Infrastructure Drift: Detecting and Preventing It

Infrastructure Drift: Detecting and Preventing It

Comments
2 min read
System Design - 20. Observability: The 3 Pillars, 4 Golden Signals, and How Netflix Debugs 100 Microservices

System Design - 20. Observability: The 3 Pillars, 4 Golden Signals, and How Netflix Debugs 100 Microservices

Comments
9 min read
Supercharging Kubernetes: How eBPF is Revolutionizing SRE and Platform Engineering

Supercharging Kubernetes: How eBPF is Revolutionizing SRE and Platform Engineering

1
Comments
6 min read
The Engineer Who Owns Nothing: A Cautionary Tale

The Engineer Who Owns Nothing: A Cautionary Tale

Comments
2 min read
Runbooks in Minutes: An On-Call Incident Copilot with HazelJS

Runbooks in Minutes: An On-Call Incident Copilot with HazelJS

5
Comments
4 min read
A Friday systems thinking thread — worth sitting with:

A Friday systems thinking thread — worth sitting with:

Comments
1 min read
Error Budget Policies That Hold Leadership Accountable

Error Budget Policies That Hold Leadership Accountable

Comments
2 min read
Fixing 500 Internal Server Errors at Scale: Expert SRE Guide

Fixing 500 Internal Server Errors at Scale: Expert SRE Guide

Comments
5 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.