DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
We built a system that investigates production incidents automatically

We built a system that investigates production incidents automatically

1
Comments
1 min read
Prometheus at Scale: Surviving the Cardinality Cliff

Prometheus at Scale: Surviving the Cardinality Cliff

Comments
2 min read
SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

Comments
4 min read
DORA metrics for the CFO: making engineering velocity legible

DORA metrics for the CFO: making engineering velocity legible

Comments
5 min read
Opsgenie 2026: Features, Pricing, EOL & Alternatives

Opsgenie 2026: Features, Pricing, EOL & Alternatives

1
Comments
15 min read
The Incident Commander Role: Running Incidents Without Chaos

The Incident Commander Role: Running Incidents Without Chaos

1
Comments
2 min read
I got tired of writing runbooks after incidents. So I'm building something.

I got tired of writing runbooks after incidents. So I'm building something.

Comments
1 min read
Why Your Microservices Need Circuit Breakers (And How to Add Them)

Why Your Microservices Need Circuit Breakers (And How to Add Them)

Comments
2 min read
The On-Call Handoff That Prevents Dropped Incidents

The On-Call Handoff That Prevents Dropped Incidents

Comments
2 min read
How I Troubleshoot Kubernetes in Production

How I Troubleshoot Kubernetes in Production

3
Comments
6 min read
I Got Tired of 35-Minute Incident Reviews — So I Built an AI SRE Copilot

I Got Tired of 35-Minute Incident Reviews — So I Built an AI SRE Copilot

1
Comments
2 min read
SLOs That Product Managers Actually Understand

SLOs That Product Managers Actually Understand

Comments
2 min read
How to Build Systems That Don’t Collapse at Global Scale

How to Build Systems That Don’t Collapse at Global Scale

2
Comments
2 min read
MTTR Optimization: The 7 Levers That Actually Move the Needle

MTTR Optimization: The 7 Levers That Actually Move the Needle

Comments
3 min read
Why Linux Kills Your App Without Warning (The OOM Killer, Explained)

Why Linux Kills Your App Without Warning (The OOM Killer, Explained)

Comments
4 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.