DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
agentic sre is where ai hype meets the pager

agentic sre is where ai hype meets the pager

Comments
6 min read
Beyond Logs: Implementing Tracing and Golden Signals for Distributed Systems

Beyond Logs: Implementing Tracing and Golden Signals for Distributed Systems

5
Comments
2 min read
BGP Edge Hygiene at a PCI-Regulated Fintech: IRR + RPKI in Production

BGP Edge Hygiene at a PCI-Regulated Fintech: IRR + RPKI in Production

3
Comments
7 min read
The Only Prometheus Metrics I Actually Alert On

The Only Prometheus Metrics I Actually Alert On

Comments
7 min read
AWS Cost Isn’t Just Finance — It’s an Engineering Problem

AWS Cost Isn’t Just Finance — It’s an Engineering Problem

Comments
1 min read
Your AI workload is not your infrastructure’s problem. Until it is.

Your AI workload is not your infrastructure’s problem. Until it is.

Comments
4 min read
Agent SRE — SLOs, Error Budgets, and Circuit Breakers for AI Agents

Agent SRE — SLOs, Error Budgets, and Circuit Breakers for AI Agents

Comments
5 min read
Disk Has Space But Can't Create Files? (Linux Inode Exhaustion)

Disk Has Space But Can't Create Files? (Linux Inode Exhaustion)

1
Comments
3 min read
Rust Friction: Production Reality

Rust Friction: Production Reality

1
Comments
5 min read
Incident response / On-call: hardening & best practices cho secret rotation (triệu chứng nguyên nhân cách fix)

Incident response / On-call: hardening & best practices cho secret rotation (triệu chứng nguyên nhân cách fix)

Comments
3 min read
Why AI and Automation Are Not Always the Right Answer in DevOps

Why AI and Automation Are Not Always the Right Answer in DevOps

Comments
3 min read
Your on-call engineer just got paged. Here's what happens to the postmortem.

Your on-call engineer just got paged. Here's what happens to the postmortem.

Comments
2 min read
Why On-Call Burnout Is an Onboarding Problem (and You Probably Don't See It)

Why On-Call Burnout Is an Onboarding Problem (and You Probably Don't See It)

Comments
1 min read
How Architecture Leaves Fingerprints in Latency Data

How Architecture Leaves Fingerprints in Latency Data

Comments
2 min read
Incident Management: Building Effective On-Call Rotations and Runbooks

Incident Management: Building Effective On-Call Rotations and Runbooks

Comments
2 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.