DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Delete 40% of your dashboards

Delete 40% of your dashboards

Comments
2 min read
Your Datadog bill is 60% DEBUG logs

Your Datadog bill is 60% DEBUG logs

Comments
2 min read
Effective On-Call Rotations: Lessons From Building Fair Schedules

Effective On-Call Rotations: Lessons From Building Fair Schedules

Comments
3 min read
Why Most Internal Developer Platforms Fail (And What To Do About It)

Why Most Internal Developer Platforms Fail (And What To Do About It)

Comments 1
2 min read
Agents, context, and guardrails on a unified platform

Agents, context, and guardrails on a unified platform

2
Comments
3 min read
We built a system that investigates production incidents automatically

We built a system that investigates production incidents automatically

1
Comments
1 min read
Prometheus at Scale: Surviving the Cardinality Cliff

Prometheus at Scale: Surviving the Cardinality Cliff

Comments
2 min read
SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

Comments
4 min read
DORA metrics for the CFO: making engineering velocity legible

DORA metrics for the CFO: making engineering velocity legible

Comments
5 min read
Opsgenie 2026: Features, Pricing, EOL & Alternatives

Opsgenie 2026: Features, Pricing, EOL & Alternatives

1
Comments
15 min read
The Incident Commander Role: Running Incidents Without Chaos

The Incident Commander Role: Running Incidents Without Chaos

1
Comments
2 min read
I got tired of writing runbooks after incidents. So I'm building something.

I got tired of writing runbooks after incidents. So I'm building something.

Comments
1 min read
Why Your Microservices Need Circuit Breakers (And How to Add Them)

Why Your Microservices Need Circuit Breakers (And How to Add Them)

Comments
2 min read
The On-Call Handoff That Prevents Dropped Incidents

The On-Call Handoff That Prevents Dropped Incidents

Comments
2 min read
How I Troubleshoot Kubernetes in Production

How I Troubleshoot Kubernetes in Production

3
Comments
6 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.