DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Why AI and Automation Are Not Always the Right Answer in DevOps

Why AI and Automation Are Not Always the Right Answer in DevOps

Comments
3 min read
Your on-call engineer just got paged. Here's what happens to the postmortem.

Your on-call engineer just got paged. Here's what happens to the postmortem.

Comments
2 min read
Why On-Call Burnout Is an Onboarding Problem (and You Probably Don't See It)

Why On-Call Burnout Is an Onboarding Problem (and You Probably Don't See It)

Comments
1 min read
How Architecture Leaves Fingerprints in Latency Data

How Architecture Leaves Fingerprints in Latency Data

Comments
2 min read
Incident Management: Building Effective On-Call Rotations and Runbooks

Incident Management: Building Effective On-Call Rotations and Runbooks

Comments
2 min read
SRE Fundamentals: Defining SLOs, SLIs, and Error Budgets That Actually Work

SRE Fundamentals: Defining SLOs, SLIs, and Error Budgets That Actually Work

Comments
2 min read
Running Postgres at Scale: Lessons Learned

Running Postgres at Scale: Lessons Learned

Comments
2 min read
ComunicaOps Parte 3.: Loops de Feedback

ComunicaOps Parte 3.: Loops de Feedback

Comments
3 min read
Why uptime and synthetic monitors still matter in the age of APM

Why uptime and synthetic monitors still matter in the age of APM

2
Comments
4 min read
I built "sysview" — a beautiful terminal system monitor for developers

I built "sysview" — a beautiful terminal system monitor for developers

Comments
3 min read
The Midnight Incident: When Being On-Call Means Losing Sleep

The Midnight Incident: When Being On-Call Means Losing Sleep

Comments
2 min read
The Agentic SRE: How Google Cloud NEXT '26 Made AI Feel Less Like a Chatbot and More Like a Teammate

Google Cloud NEXT '26 Challenge Submission

The Agentic SRE: How Google Cloud NEXT '26 Made AI Feel Less Like a Chatbot and More Like a Teammate

4
Comments
4 min read
From AIOps Anomaly Detection to LLM-Powered RCA: How AI for Incident Response Actually Evolved

From AIOps Anomaly Detection to LLM-Powered RCA: How AI for Incident Response Actually Evolved

132
Comments 9
5 min read
I built an AI that remembers every production incident. Here's what changed.

I built an AI that remembers every production incident. Here's what changed.

Comments 1
3 min read
Database Reliability: The SRE Approach to Keeping Data Safe

Database Reliability: The SRE Approach to Keeping Data Safe

1
Comments
3 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.