DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
We list 3 self-host PagerDuty alternatives. None of them are alive. (May 2026)

We list 3 self-host PagerDuty alternatives. None of them are alive. (May 2026)

Comments
5 min read
The PagerDuty Migration Playbook

The PagerDuty Migration Playbook

Comments
1 min read
subPath ConfigMap Mounts Don't Hot-Reload: Silent Drift in Kubernetes

subPath ConfigMap Mounts Don't Hot-Reload: Silent Drift in Kubernetes

Comments
6 min read
How We Cut Datadog Bills by 60% Without Losing Observability

How We Cut Datadog Bills by 60% Without Losing Observability

Comments
1 min read
Building Your First Runbook: A Template That Actually Works

Building Your First Runbook: A Template That Actually Works

Comments
1 min read
Why Your DNS Failover Didn't Actually Fail Over

Why Your DNS Failover Didn't Actually Fail Over

Comments
4 min read
Two SQL primitives for when alert clustering gets it wrong

Two SQL primitives for when alert clustering gets it wrong

Comments
12 min read
AIOps vs Traditional Monitoring: What Actually Changed

AIOps vs Traditional Monitoring: What Actually Changed

Comments
1 min read
Chaos Engineering: Building Resilient Systems in Production

Chaos Engineering: Building Resilient Systems in Production

Comments
2 min read
IRAS: Building a Production-Grade Autonomous Incident Response Agent

IRAS: Building a Production-Grade Autonomous Incident Response Agent

Comments
4 min read
YOLO Is a Terrible Strategy for Validating Production Changes

YOLO Is a Terrible Strategy for Validating Production Changes

Comments
2 min read
The Double-Exposure Problem: When AI Agents and AI-Generated Code Fail Together

The Double-Exposure Problem: When AI Agents and AI-Generated Code Fail Together

1
Comments
6 min read
The runbook step I always add: "what does normal look like right now?"

The runbook step I always add: "what does normal look like right now?"

Comments
3 min read
Ansible state:latest Broke Payments for 47 Minutes — What Really Happened and How to Prevent It

Ansible state:latest Broke Payments for 47 Minutes — What Really Happened and How to Prevent It

Comments
4 min read
CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

2
Comments 1
12 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.