DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
The Dashboard Audit: Finding and Killing Dead Metrics

The Dashboard Audit: Finding and Killing Dead Metrics

Comments
2 min read
Why Fail-Closed Security Matters for Critical Systems

Why Fail-Closed Security Matters for Critical Systems

1
Comments
1 min read
Why We Stopped Using Log Aggregation for Everything

Why We Stopped Using Log Aggregation for Everything

Comments
1 min read
The Cost Math Behind Our CI Cache Hit Rate Going From 40% to 91%

The Cost Math Behind Our CI Cache Hit Rate Going From 40% to 91%

Comments
4 min read
Two states weren't enough. Here's why I added WATCH.

Two states weren't enough. Here's why I added WATCH.

Comments
6 min read
How we caught a silent IO storm before it hit production 🌩️

How we caught a silent IO storm before it hit production 🌩️

Comments
1 min read
Chaos testing your CI runner fleet when half the jobs call an LLM

Chaos testing your CI runner fleet when half the jobs call an LLM

Comments 1
4 min read
Cost Attribution in Shared Infrastructure

Cost Attribution in Shared Infrastructure

Comments 2
2 min read
How We Reduced Our Deployment Failure Rate to Under 2%

How We Reduced Our Deployment Failure Rate to Under 2%

Comments
1 min read
Terraform Drift Detection and Recovery on Google Cloud: Plan, Import, State, and GitHub Actions

Terraform Drift Detection and Recovery on Google Cloud: Plan, Import, State, and GitHub Actions

1
Comments
8 min read
The Hidden Cost of Flaky Tests

The Hidden Cost of Flaky Tests

Comments
1 min read
Why Applications Work Locally But Fail in Production

Why Applications Work Locally But Fail in Production

Comments
4 min read
A Clean-Room Kubernetes CrashLoopBackOff Incident Exercise for SRE/DevOps Learners

A Clean-Room Kubernetes CrashLoopBackOff Incident Exercise for SRE/DevOps Learners

Comments
4 min read
Migrating from Opsgenie to All Quiet: A Full Terraform-First Guide

Migrating from Opsgenie to All Quiet: A Full Terraform-First Guide

Comments
10 min read
Real-Time Monitoring for SaaS: Metrics, Dashboards & Alerting

Real-Time Monitoring for SaaS: Metrics, Dashboards & Alerting

Comments
7 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.