DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Decoding System Observability: Building Transparent and Resilient Architectures

Decoding System Observability: Building Transparent and Resilient Architectures

Comments
2 min read
Load Balancer Tuning: Lessons from Production

Load Balancer Tuning: Lessons from Production

Comments
2 min read
Stop Guessing, Start Profiling: A Dev's Guide to Go Mechanics

Stop Guessing, Start Profiling: A Dev's Guide to Go Mechanics

1
Comments
5 min read
We open-sourced the SRE judgment that doesn't fit in a system prompt

We open-sourced the SRE judgment that doesn't fit in a system prompt

Comments
3 min read
Google Published Their AI SRE Blueprint. Here's the Line-by-Line Mapping to What the Community Has Been Building

Google Published Their AI SRE Blueprint. Here's the Line-by-Line Mapping to What the Community Has Been Building

Comments
3 min read
Beyond Ingress: Why the Kubernetes Gateway API is the Future of Cloud Native Networking

Beyond Ingress: Why the Kubernetes Gateway API is the Future of Cloud Native Networking

1
Comments
6 min read
How DevOps Engineers Can Use AI to Triage Production Incidents Faster

How DevOps Engineers Can Use AI to Triage Production Incidents Faster

Comments
5 min read
Error budgets when downtime costs money: reliability engineering for payment-critical systems

Error budgets when downtime costs money: reliability engineering for payment-critical systems

Comments
10 min read
Capacity Planning for Startups

Capacity Planning for Startups

Comments
2 min read
Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Comments
17 min read
Why Your AKS Pods Keep Getting OOMKilled Even When CPU Looks Fine

Why Your AKS Pods Keep Getting OOMKilled Even When CPU Looks Fine

Comments
4 min read
I shipped a verdict layer that gated deploys. It quietly broke trust.

I shipped a verdict layer that gated deploys. It quietly broke trust.

Comments
6 min read
How We Handled Our First Major Outage (And Survived)

How We Handled Our First Major Outage (And Survived)

Comments
2 min read
Zero to Platform: How GKE Autopilot and Google Cloud Redefine Modern SRE and Platform Engineering

Zero to Platform: How GKE Autopilot and Google Cloud Redefine Modern SRE and Platform Engineering

Comments
7 min read
How I Built an AI Agent That Fixes Production Errors Using Memory — And Why Memory Changes Everything

How I Built an AI Agent That Fixes Production Errors Using Memory — And Why Memory Changes Everything

Comments
6 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.