DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Noisy alerts làm kiệt sức on-call: thiết kế alert theo SLO (ít nhưng chất)

Noisy alerts làm kiệt sức on-call: thiết kế alert theo SLO (ít nhưng chất)

1
Comments
3 min read
Runbook Template Library

Runbook Template Library

Comments
3 min read
Postmortem Framework

Postmortem Framework

Comments
4 min read
Chaos Engineering Toolkit

Chaos Engineering Toolkit

Comments
4 min read
Platform Developer Portal

Platform Developer Portal

Comments
3 min read
Inside the Agentic Loop: How 5 AI Agents Autonomously Investigate IT Incidents

Inside the Agentic Loop: How 5 AI Agents Autonomously Investigate IT Incidents

Comments
3 min read
Designing Systems That Don’t Lie: How to Build Software That Fails Loudly, Recovers Fast, and Keeps User Trust

Designing Systems That Don’t Lie: How to Build Software That Fails Loudly, Recovers Fast, and Keeps User Trust

Comments
6 min read
The Quiet Skill That Prevents Repeat Outages: Writing Incidents Like an Engineer, Not a Courtroom

The Quiet Skill That Prevents Repeat Outages: Writing Incidents Like an Engineer, Not a Courtroom

Comments
5 min read
Building Reliable Software: The Trap of Convenience

Building Reliable Software: The Trap of Convenience

Comments
7 min read
Chapter 2 — RML-1 (Closed World): Build a Room Where Failure Is Safe

Chapter 2 — RML-1 (Closed World): Build a Room Where Failure Is Safe

Comments
7 min read
The AI Incident Report Template I Actually Use for Wrong Answers and Tool Failures

The AI Incident Report Template I Actually Use for Wrong Answers and Tool Failures

6
Comments
3 min read
When Systems Fail, Trust Is the Real Incident: A Practical Guide to Communication for Engineers and Founders

When Systems Fail, Trust Is the Real Incident: A Practical Guide to Communication for Engineers and Founders

Comments
5 min read
AI Agents in Production: The Future of SRE and DevOps

AI Agents in Production: The Future of SRE and DevOps

4
Comments 1
3 min read
From Stack Trace to Root Cause - Archexa's New Diagnose Command

From Stack Trace to Root Cause - Archexa's New Diagnose Command

Comments
7 min read
The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

The Pre-Flight Checklist: 9 Things to Analyze Before Cutting Any AWS Cost

1
Comments
14 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.