DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Chaos testing your CI runner fleet when half the jobs call an LLM

Chaos testing your CI runner fleet when half the jobs call an LLM

Comments 1
4 min read
Cost Attribution in Shared Infrastructure

Cost Attribution in Shared Infrastructure

Comments 2
2 min read
How We Reduced Our Deployment Failure Rate to Under 2%

How We Reduced Our Deployment Failure Rate to Under 2%

Comments
1 min read
Terraform Drift Detection and Recovery on Google Cloud: Plan, Import, State, and GitHub Actions

Terraform Drift Detection and Recovery on Google Cloud: Plan, Import, State, and GitHub Actions

1
Comments
8 min read
The Hidden Cost of Flaky Tests

The Hidden Cost of Flaky Tests

Comments
1 min read
Why Applications Work Locally But Fail in Production

Why Applications Work Locally But Fail in Production

Comments
4 min read
Migrating from Opsgenie to All Quiet: A Full Terraform-First Guide

Migrating from Opsgenie to All Quiet: A Full Terraform-First Guide

Comments
10 min read
SLA vs SLO vs SLI: what's the difference and why it matters

SLA vs SLO vs SLI: what's the difference and why it matters

Comments
9 min read
Real-Time Monitoring for SaaS: Metrics, Dashboards & Alerting

Real-Time Monitoring for SaaS: Metrics, Dashboards & Alerting

Comments
7 min read
SLO examples for financial services: what good performance looks like in fintech

SLO examples for financial services: what good performance looks like in fintech

Comments
6 min read
When CoreDNS Falls Silent : A Kubernetes DNS Disaster Story & The Playbook That Saved Us

When CoreDNS Falls Silent : A Kubernetes DNS Disaster Story & The Playbook That Saved Us

Comments
13 min read
Observability for Serverless: What's Different

Observability for Serverless: What's Different

Comments
2 min read
Why post-deploy verification deserves its own category

Why post-deploy verification deserves its own category

Comments
4 min read
Stop Drowning in Log Noise: How Grouping Rules Turn Chaos into Signal

Stop Drowning in Log Noise: How Grouping Rules Turn Chaos into Signal

Comments
4 min read
Why Your AI Agent Monitoring is Wrong (And How to Fix It)

Why Your AI Agent Monitoring is Wrong (And How to Fix It)

Comments 2
3 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.