DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Log Management at Scale: How We Cut Costs 70% Without Losing Signal

Log Management at Scale: How We Cut Costs 70% Without Losing Signal

Comments
2 min read
Why P95 Latency Is the Only Metric That Matters at 3 AM

Why P95 Latency Is the Only Metric That Matters at 3 AM

Comments
4 min read
Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%

Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%

Comments 1
2 min read
Platform Engineering: Building an Internal Developer Platform That Teams Actually Use

Platform Engineering: Building an Internal Developer Platform That Teams Actually Use

Comments
2 min read
How We Handle SSL Certificate Expiration Alerts at Scale

How We Handle SSL Certificate Expiration Alerts at Scale

Comments
6 min read
This is what separates teams that scale from teams that survive:

This is what separates teams that scale from teams that survive:

1
Comments
1 min read
AWS Summit Seoul 2026: Korean Enterprises And Agentic AI

AWS Summit Seoul 2026: Korean Enterprises And Agentic AI

1
Comments
5 min read
# Sentinel Diary #4: From Dashboard to Incident Response — The deterministic path to reliable SRE

# Sentinel Diary #4: From Dashboard to Incident Response — The deterministic path to reliable SRE

Comments
5 min read
Are you using traffic mirroring in production? If not, try it out.

Are you using traffic mirroring in production? If not, try it out.

Comments
2 min read
Chaos Engineering for Teams That Aren't Netflix

Chaos Engineering for Teams That Aren't Netflix

Comments
3 min read
Building a Self-Healing Kill Switch for AI Infrastructure

Building a Self-Healing Kill Switch for AI Infrastructure

Comments
1 min read
Production-Grade Observability: Building a Complete LGTM Stack with SLOs, DORA Metrics, and Intelligent Alerting

Production-Grade Observability: Building a Complete LGTM Stack with SLOs, DORA Metrics, and Intelligent Alerting

2
Comments
10 min read
Your AI Agent Doesn't Have a Feature Problem. It Has an On-Call Rotation Problem. published: true

Your AI Agent Doesn't Have a Feature Problem. It Has an On-Call Rotation Problem. published: true

1
Comments
5 min read
Backpressure in document pipelines is an architecture problem first

Backpressure in document pipelines is an architecture problem first

Comments
2 min read
Optimizing Cement Kiln Heat Consumption: A Process Engineer’s Python Approach

Optimizing Cement Kiln Heat Consumption: A Process Engineer’s Python Approach

Comments
4 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.