DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Why Your DNS Failover Didn't Actually Fail Over

Why Your DNS Failover Didn't Actually Fail Over

Comments
4 min read
Two SQL primitives for when alert clustering gets it wrong

Two SQL primitives for when alert clustering gets it wrong

Comments
12 min read
AIOps vs Traditional Monitoring: What Actually Changed

AIOps vs Traditional Monitoring: What Actually Changed

Comments
1 min read
Chaos Engineering: Building Resilient Systems in Production

Chaos Engineering: Building Resilient Systems in Production

Comments
2 min read
Auto-verifying your AI-SRE's fixes against your real cluster, with mirrord

Auto-verifying your AI-SRE's fixes against your real cluster, with mirrord

7
Comments 2
8 min read
IRAS: Building a Production-Grade Autonomous Incident Response Agent

IRAS: Building a Production-Grade Autonomous Incident Response Agent

Comments
4 min read
YOLO Is a Terrible Strategy for Validating Production Changes

YOLO Is a Terrible Strategy for Validating Production Changes

Comments
2 min read
The Double-Exposure Problem: When AI Agents and AI-Generated Code Fail Together

The Double-Exposure Problem: When AI Agents and AI-Generated Code Fail Together

1
Comments
6 min read
The runbook step I always add: "what does normal look like right now?"

The runbook step I always add: "what does normal look like right now?"

Comments
3 min read
Ansible state:latest Broke Payments for 47 Minutes — What Really Happened and How to Prevent It

Ansible state:latest Broke Payments for 47 Minutes — What Really Happened and How to Prevent It

Comments
4 min read
CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

2
Comments 1
12 min read
Building ReefWatch, a Coral-Powered Production Triage Agent

Building ReefWatch, a Coral-Powered Production Triage Agent

Comments 1
18 min read
Agentic Ops: How I Shipped My Vibe-Coded Game to Production

Agentic Ops: How I Shipped My Vibe-Coded Game to Production

Comments 2
2 min read
Building an Incident Response Playbook Library

Building an Incident Response Playbook Library

Comments
4 min read
Runbook-Driven Development: A New Way to Ship

Runbook-Driven Development: A New Way to Ship

Comments 1
2 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.