DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Designing Systems That Don’t Lie: How to Build Software That Fails Loudly, Recovers Fast, and Keeps User Trust

Designing Systems That Don’t Lie: How to Build Software That Fails Loudly, Recovers Fast, and Keeps User Trust

Comments
6 min read
The Quiet Skill That Prevents Repeat Outages: Writing Incidents Like an Engineer, Not a Courtroom

The Quiet Skill That Prevents Repeat Outages: Writing Incidents Like an Engineer, Not a Courtroom

Comments
5 min read
Chapter 2 — RML-1 (Closed World): Build a Room Where Failure Is Safe

Chapter 2 — RML-1 (Closed World): Build a Room Where Failure Is Safe

Comments
7 min read
Building Reliable Software: The Trap of Convenience

Building Reliable Software: The Trap of Convenience

Comments
7 min read
The AI Incident Report Template I Actually Use for Wrong Answers and Tool Failures

The AI Incident Report Template I Actually Use for Wrong Answers and Tool Failures

6
Comments
3 min read
When Systems Fail, Trust Is the Real Incident: A Practical Guide to Communication for Engineers and Founders

When Systems Fail, Trust Is the Real Incident: A Practical Guide to Communication for Engineers and Founders

Comments
5 min read
AI Agents in Production: The Future of SRE and DevOps

AI Agents in Production: The Future of SRE and DevOps

4
Comments 1
3 min read
From Stack Trace to Root Cause - Archexa's New Diagnose Command

From Stack Trace to Root Cause - Archexa's New Diagnose Command

Comments
7 min read
The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

1
Comments
14 min read
How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production

How I Found $12K/Year in AWS Waste Across 4 Accounts — Without Touching Production

Comments
12 min read
FinOps for SREs: Cutting Costs Without Breaking Things

FinOps for SREs: Cutting Costs Without Breaking Things

1
Comments
3 min read
The Silent Process

The Silent Process

1
Comments
3 min read
How We Stopped Fighting Enterprise Auth and Read Calendars With a URL

How We Stopped Fighting Enterprise Auth and Read Calendars With a URL

1
Comments
8 min read
When Everything Is On Fire: Incident Communication That Engineers (and Users) Can Trust

When Everything Is On Fire: Incident Communication That Engineers (and Users) Can Trust

Comments
5 min read
Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

Comments
6 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.