DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
DevOps Meets Artificial Intelligence — The Pipeline Reinvented

DevOps Meets Artificial Intelligence — The Pipeline Reinvented

Comments
5 min read
🚀 How I Built StackPulse: A One-Command Kubernetes Observability Platform in Go

🚀 How I Built StackPulse: A One-Command Kubernetes Observability Platform in Go

Comments
2 min read
Railway vs AWS: When Leaving Railway Means Owning Reliability

Railway vs AWS: When Leaving Railway Means Owning Reliability

1
Comments 1
14 min read
How AI Is Changing SRE Workflows (Without Replacing SREs)

How AI Is Changing SRE Workflows (Without Replacing SREs)

Comments
2 min read
The 32-bit Hidden Countdown in ClickHouse Keeper: How an XID Overflow Gave Us Weekly Read-Only Bursts

The 32-bit Hidden Countdown in ClickHouse Keeper: How an XID Overflow Gave Us Weekly Read-Only Bursts

Comments
7 min read
I Audited 12 Solo Founders' AI Agents in 2026. Here's What I Found.

I Audited 12 Solo Founders' AI Agents in 2026. Here's What I Found.

Comments
6 min read
The SIGTERM our build workers ignored, and the 90s that fixed it

The SIGTERM our build workers ignored, and the 90s that fixed it

Comments
4 min read
Why Most Disaster Recovery Tests Don't Test Recovery

Why Most Disaster Recovery Tests Don't Test Recovery

Comments
4 min read
Security Monitoring for SRE Teams

Security Monitoring for SRE Teams

Comments
2 min read
Provider drift broke our regression evals. We pinned versions through Bifrost.

Provider drift broke our regression evals. We pinned versions through Bifrost.

Comments
4 min read
Beyond DORA: A Five-Metric Framework for SRE Maturity in Regulated Enterprises

Beyond DORA: A Five-Metric Framework for SRE Maturity in Regulated Enterprises

Comments
13 min read
The 60-Second Break-Glass Protocol: Hot-Patching Live Production Outages via Local Tunnels

The 60-Second Break-Glass Protocol: Hot-Patching Live Production Outages via Local Tunnels

Comments
11 min read
Instrumenting Legacy Code Without Rewriting It

Instrumenting Legacy Code Without Rewriting It

Comments
2 min read
I Let Claude Design 4 Chaos Experiments via MCP. The 4th Took Down Staging and Found a 6-Month-Old Bug.

I Let Claude Design 4 Chaos Experiments via MCP. The 4th Took Down Staging and Found a 6-Month-Old Bug.

1
Comments
11 min read
System Design - Availability & Reliability: What "99.9% Uptime" Really Means (And Why It's Not Enough)

System Design - Availability & Reliability: What "99.9% Uptime" Really Means (And Why It's Not Enough)

Comments
6 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.