DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
The Business Case for Chaos Engineering: An ROI Calculator for Testing Application Reliability

The Business Case for Chaos Engineering: An ROI Calculator for Testing Application Reliability

2
Comments
6 min read
Cron Jobs That Fix Themselves

Cron Jobs That Fix Themselves

1
Comments 1
3 min read
Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams

Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams

Comments
3 min read
Observability Engineering in Production Systems: Structured Logging, Metrics, and Distributed Tracing at Scale (Part 2)

Observability Engineering in Production Systems: Structured Logging, Metrics, and Distributed Tracing at Scale (Part 2)

Comments
10 min read
Observability Engineering in Production Systems: Structured Logging, Metrics, and Distributed Tracing at Scale (Part 3)

Observability Engineering in Production Systems: Structured Logging, Metrics, and Distributed Tracing at Scale (Part 3)

1
Comments 1
12 min read
Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams

Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams

Comments
3 min read
From Disaster to Recovery: A Practical Case Study on Kubernetes etcd Backups

From Disaster to Recovery: A Practical Case Study on Kubernetes etcd Backups

Comments
11 min read
Chapter 8 — Autonomy in the History World: The Legal–Business–SRE Triangle

Chapter 8 — Autonomy in the History World: The Legal–Business–SRE Triangle

Comments
6 min read
How blue/green deployments saved us from out of hours changes and downtime

How blue/green deployments saved us from out of hours changes and downtime

1
Comments
2 min read
When Software Lies Before It Fails

When Software Lies Before It Fails

Comments
5 min read
Alert Fatigue Is Real — Here's What It's Actually Costing Your Team

Alert Fatigue Is Real — Here's What It's Actually Costing Your Team

1
Comments
5 min read
I Put an AI Agent in My Incident Workflow for 7 Days. Here’s What Actually Broke.

I Put an AI Agent in My Incident Workflow for 7 Days. Here’s What Actually Broke.

9
Comments 18
2 min read
How We Made Next.js ISR Page Cache Efficient with Redis

How We Made Next.js ISR Page Cache Efficient with Redis

1
Comments
8 min read
⚔️ Kubernetes Civil War: When VPA Fights the Scheduler (And Your Pods Pay the Price)

⚔️ Kubernetes Civil War: When VPA Fights the Scheduler (And Your Pods Pay the Price)

Comments
6 min read
đź§  The Hidden Brain of Kubernetes: How Pod Scheduling Really Works (And Why It's Smarter Than You Think)

đź§  The Hidden Brain of Kubernetes: How Pod Scheduling Really Works (And Why It's Smarter Than You Think)

Comments
4 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.