DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Stop Guessing, Start Profiling: A Dev's Guide to Go Mechanics

Stop Guessing, Start Profiling: A Dev's Guide to Go Mechanics

2
Comments
5 min read
Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

Comments
6 min read
How We Killed Our Worst Alert (And What We Learned)

How We Killed Our Worst Alert (And What We Learned)

Comments
2 min read
The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

Comments
11 min read
Why Backup Success Does Not Mean Database Recoverability

Why Backup Success Does Not Mean Database Recoverability

Comments
2 min read
Game day on our build cluster: killing an AZ to test LLM flake detection

Game day on our build cluster: killing an AZ to test LLM flake detection

Comments
4 min read
I got tired of writing post-mortems — so I built RCAi for SREs

I got tired of writing post-mortems — so I built RCAi for SREs

Comments
1 min read
Diagnosing KubeAPIErrorBudgetBurn: When a 7-Year-Old Disk Takes Down Your Control Plane

Diagnosing KubeAPIErrorBudgetBurn: When a 7-Year-Old Disk Takes Down Your Control Plane

Comments
5 min read
The Reliability Roadmap: A 90-Day Plan for New SRE Teams

The Reliability Roadmap: A 90-Day Plan for New SRE Teams

Comments
2 min read
Zero-Downtime Database Migrations

Zero-Downtime Database Migrations

Comments
2 min read
Our PR-review bot kept hitting 429s. Bifrost key pooling fixed it.

Our PR-review bot kept hitting 429s. Bifrost key pooling fixed it.

Comments
4 min read
Continuous batching wrecked our p99 latency. Here's the trace.

Continuous batching wrecked our p99 latency. Here's the trace.

Comments
4 min read
10 production-grade alert rules for Cosmos validators (with real PromQL)

10 production-grade alert rules for Cosmos validators (with real PromQL)

1
Comments
4 min read
How try-with-resources Quietly Leaked Disk in Production

How try-with-resources Quietly Leaked Disk in Production

Comments
2 min read
When one reliability surface has to satisfy everyone

When one reliability surface has to satisfy everyone

1
Comments
5 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.