DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
When CoreDNS Falls Silent : A Kubernetes DNS Disaster Story & The Playbook That Saved Us

When CoreDNS Falls Silent : A Kubernetes DNS Disaster Story & The Playbook That Saved Us

Comments
13 min read
Observability for Serverless: What's Different

Observability for Serverless: What's Different

Comments
2 min read
How a Single NAT Gateway Can Silently Kill Your AWS High Availability

How a Single NAT Gateway Can Silently Kill Your AWS High Availability

1
Comments
5 min read
Why post-deploy verification deserves its own category

Why post-deploy verification deserves its own category

Comments
4 min read
Stop Drowning in Log Noise: How Grouping Rules Turn Chaos into Signal

Stop Drowning in Log Noise: How Grouping Rules Turn Chaos into Signal

Comments
4 min read
Why Your AI Agent Monitoring is Wrong (And How to Fix It)

Why Your AI Agent Monitoring is Wrong (And How to Fix It)

Comments 2
3 min read
The AI Agent Cost Ceiling Problem: Why Your AWS Bill Is Your Reliability Alert

The AI Agent Cost Ceiling Problem: Why Your AWS Bill Is Your Reliability Alert

Comments
4 min read
From DevOps to SRE: Making the Transition

From DevOps to SRE: Making the Transition

Comments
2 min read
I Built 20 AI-Powered DevOps Tools Because I Got Tired of Doing This Stuff Manually

I Built 20 AI-Powered DevOps Tools Because I Got Tired of Doing This Stuff Manually

Comments
3 min read
Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation

Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation

1
Comments
8 min read
Building GBIM Observability From Correlation IDs to a Populated k6 Dashboard

Building GBIM Observability From Correlation IDs to a Populated k6 Dashboard

Comments
7 min read
Using the github actions to automate monitoring dashboards

Using the github actions to automate monitoring dashboards

1
Comments
4 min read
Closed-Loop SRE for Kubernetes: Auto-Remediating Pod Crashloops Before the On-Call Pages

Closed-Loop SRE for Kubernetes: Auto-Remediating Pod Crashloops Before the On-Call Pages

1
Comments
6 min read
Designing for Partial Failure: Why 'Everything is Highly Available' Is a Myth

Designing for Partial Failure: Why 'Everything is Highly Available' Is a Myth

Comments
3 min read
What Site Reliability Engineering Actually Is, and Why It's a National Infrastructure Discipline

What Site Reliability Engineering Actually Is, and Why It's a National Infrastructure Discipline

Comments
10 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.