DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Silent Failures: The Bug That Won't Page You

Silent Failures: The Bug That Won't Page You

1
Comments
3 min read
Claude Code for the Outer Loop: An AI SRE Playbook to Reduce On-Call Toil

Claude Code for the Outer Loop: An AI SRE Playbook to Reduce On-Call Toil

4
Comments
18 min read
If You Were a Server: How to Detect Issues and Keep Things Running Smoothly

If You Were a Server: How to Detect Issues and Keep Things Running Smoothly

1
Comments
10 min read
Infrastructure dilemma

Infrastructure dilemma

1
Comments
2 min read
Why Most Systems Still Have Hidden Single Points of Failure (SPOF) – Even in 2026

Why Most Systems Still Have Hidden Single Points of Failure (SPOF) – Even in 2026

1
Comments
2 min read
Developer autonomy and the work that repeats after ship

Developer autonomy and the work that repeats after ship

3
Comments
3 min read
When should you use canary deployments?

When should you use canary deployments?

6
Comments
5 min read
Incident Debugging in Production Systems (Part 2)

Incident Debugging in Production Systems (Part 2)

Comments
3 min read
SRE vs DevOps: the sequencing mistake that burns most startups.

SRE vs DevOps: the sequencing mistake that burns most startups.

Comments 1
3 min read
Hành trình DevOps: 12 bài học giúp hệ thống ổn định hơn (và bạn bớt trực đêm)

Hành trình DevOps: 12 bài học giúp hệ thống ổn định hơn (và bạn bớt trực đêm)

1
Comments
5 min read
Canonical Log Lines Stripe Brilliant Technique for Production Observability

Canonical Log Lines Stripe Brilliant Technique for Production Observability

Comments
4 min read
Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET

Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET

1
Comments
15 min read
How to Build Systems That Don’t Collapse at Global Scale

How to Build Systems That Don’t Collapse at Global Scale

2
Comments
2 min read
Our Production System Went Down at 2:13AM — Here’s Exactly What Happened

Our Production System Went Down at 2:13AM — Here’s Exactly What Happened

1
Comments
1 min read
When Monitoring Saves the Day: How We Optimized Our Production Database Without Increasing Costs

When Monitoring Saves the Day: How We Optimized Our Production Database Without Increasing Costs

Comments
3 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.