DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
🏗️ Building the Platform That Empowers Reliability by Design

🏗️ Building the Platform That Empowers Reliability by Design

Comments
3 min read
Modern CTO Podcast: The AI SRE Hype and How to Get it Right

Modern CTO Podcast: The AI SRE Hype and How to Get it Right

Comments
1 min read
How to reduce on-call friction using AI Voice Agent

How to reduce on-call friction using AI Voice Agent

Comments 1
4 min read
Тулзы для работы с сотнями серверов

Тулзы для работы с сотнями серверов

Comments
1 min read
The Samurai Server: Why "Heroic" Systems Always Die

The Samurai Server: Why "Heroic" Systems Always Die

Comments
4 min read
The Complete 2026 and beyond Google SRE Interview Preparation Guide — Frameworks, Scenarios, and Roadmap

The Complete 2026 and beyond Google SRE Interview Preparation Guide — Frameworks, Scenarios, and Roadmap

Comments
4 min read
Inside the AWS US-East-1 Outage: Why DNS Failure Triggered a Global Cloud Crisis

Inside the AWS US-East-1 Outage: Why DNS Failure Triggered a Global Cloud Crisis

Comments
5 min read
Your Wiki is Useless Under Pressure: 9 Actionable Steps to Drastically Lower MTTR

Your Wiki is Useless Under Pressure: 9 Actionable Steps to Drastically Lower MTTR

Comments
4 min read
AWS Lambda Reload

AWS Lambda Reload

Comments
2 min read
SREday SF 2025: Human Centered SRE In An AI World

SREday SF 2025: Human Centered SRE In An AI World

Comments
7 min read
The Role Confusion: SRE vs Cloud vs Platform Engineer (And Why "DevOps Engineer" Misses the Point)

The Role Confusion: SRE vs Cloud vs Platform Engineer (And Why "DevOps Engineer" Misses the Point)

3
Comments
5 min read
Why S3, NFS, and EFS Are Not Block Storage

Why S3, NFS, and EFS Are Not Block Storage

Comments
2 min read
Beyond Scheduling: How Kubernetes Uses QoS, Priority, and Scoring to Keep Your Cluster Balanced

Beyond Scheduling: How Kubernetes Uses QoS, Priority, and Scoring to Keep Your Cluster Balanced

Comments
2 min read
⚙️ 7 AI-Powered Prompts That Supercharge Your Terraform Workflow

⚙️ 7 AI-Powered Prompts That Supercharge Your Terraform Workflow

Comments
3 min read
SRE in Action: Understanding How Real Teams Use SLOs, SLIs, and Error Budgets to Stay Reliable Through Case Studies - Part 1

SRE in Action: Understanding How Real Teams Use SLOs, SLIs, and Error Budgets to Stay Reliable Through Case Studies - Part 1

4
Comments
7 min read
Your Observability Bill Just Hit $1M—Here's Why Telemetry Pipelines Aren't Optional Anymore

Your Observability Bill Just Hit $1M—Here's Why Telemetry Pipelines Aren't Optional Anymore

3
Comments
2 min read
Crash Dumps in Linux Kernel & Application Deep Dive

Crash Dumps in Linux Kernel & Application Deep Dive

2
Comments
3 min read
Service metrics and its meanings

Service metrics and its meanings

Comments
8 min read
Building a Modern Network Observability Stack: Combining Prometheus, Grafana, and Loki for Deep Insight

Building a Modern Network Observability Stack: Combining Prometheus, Grafana, and Loki for Deep Insight

Comments
6 min read
The Silent Co-Pilot: How AI is redefining the Network and the Network Engineer

The Silent Co-Pilot: How AI is redefining the Network and the Network Engineer

Comments
5 min read
VMware Snapshots Explained: Internals, Pitfalls, and Deep Dive into Base + Delta Mechanics

VMware Snapshots Explained: Internals, Pitfalls, and Deep Dive into Base + Delta Mechanics

6
Comments
4 min read
The Real State of Helm Chart Reliability (2025): Hidden Risks in 100+ Open‑Source Charts

The Real State of Helm Chart Reliability (2025): Hidden Risks in 100+ Open‑Source Charts

Comments
23 min read
Self-Healing File-Based Databroker Without The Postgres Headaches

Self-Healing File-Based Databroker Without The Postgres Headaches

5
Comments 1
2 min read
Thoughts on SLA

Thoughts on SLA

3
Comments
3 min read
Our Status Page Lied to Us: 7 Steps to Building a Communication Platform Customers Actually Trust

Our Status Page Lied to Us: 7 Steps to Building a Communication Platform Customers Actually Trust

2
Comments
9 min read
loading...