DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
YOLO Is a Terrible Strategy for Validating Production Changes

YOLO Is a Terrible Strategy for Validating Production Changes

Comments
2 min read
The Double-Exposure Problem: When AI Agents and AI-Generated Code Fail Together

The Double-Exposure Problem: When AI Agents and AI-Generated Code Fail Together

1
Comments
6 min read
The runbook step I always add: "what does normal look like right now?"

The runbook step I always add: "what does normal look like right now?"

Comments
3 min read
Ansible state:latest Broke Payments for 47 Minutes — What Really Happened and How to Prevent It

Ansible state:latest Broke Payments for 47 Minutes — What Really Happened and How to Prevent It

Comments
4 min read
CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

2
Comments 1
12 min read
Building an Incident Response Playbook Library

Building an Incident Response Playbook Library

Comments
4 min read
Kubernetes Network Policies: Lessons from Production Incidents

Kubernetes Network Policies: Lessons from Production Incidents

Comments
4 min read
Building a Production-Grade Observability Platform for the Anvila API with LGTM, SLOs, DORA Metrics, and Game Day Testing

Building a Production-Grade Observability Platform for the Anvila API with LGTM, SLOs, DORA Metrics, and Game Day Testing

1
Comments 2
10 min read
The Agent Control Plane is an SRE Problem: Governing the Orchestration Layer Nobody is Watching

The Agent Control Plane is an SRE Problem: Governing the Orchestration Layer Nobody is Watching

Comments
5 min read
Why LLMs Can't Replace Your SREs (Yet)

Why LLMs Can't Replace Your SREs (Yet)

Comments
5 min read
We rotated our JWKS without overlap. Here is the 4-minute window that broke prod.

We rotated our JWKS without overlap. Here is the 4-minute window that broke prod.

3
Comments
5 min read
Part 9 — Operating the gateway: logs, traces, health, and degraded mode

Part 9 — Operating the gateway: logs, traces, health, and degraded mode

Comments
9 min read
Reducing Toil: The Google SRE Book Applied to Startups

Reducing Toil: The Google SRE Book Applied to Startups

Comments
4 min read
How to Write API Integration Tests (That Actually Catch Bugs)

How to Write API Integration Tests (That Actually Catch Bugs)

Comments
2 min read
[Guide] Stop 502 errors with queues ⚡

[Guide] Stop 502 errors with queues ⚡

Comments
1 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.