DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Beyond Dashboards: How FinOps and AI-Driven Observability are Reshaping SRE in 2026

Beyond Dashboards: How FinOps and AI-Driven Observability are Reshaping SRE in 2026

Comments
3 min read
🚨 How We Rescued a Dead Azure Linux VM After SSH, Agent, and OS Disk All Broke (A Real Production War Story)

🚨 How We Rescued a Dead Azure Linux VM After SSH, Agent, and OS Disk All Broke (A Real Production War Story)

5
Comments
3 min read
10 MCP Servers to Improve DevOps Workflows

10 MCP Servers to Improve DevOps Workflows

Comments
15 min read
Your AI SRE needs better observability, not bigger models.

Your AI SRE needs better observability, not bigger models.

9
Comments
17 min read
Operability First: Policy, Not Hope

Operability First: Policy, Not Hope

Comments
8 min read
EP 6 - Don't Kill Flaky APIs: The Art of Resilient Retries

EP 6 - Don't Kill Flaky APIs: The Art of Resilient Retries

Comments
1 min read
Deduce, Don't Store

Deduce, Don't Store

Comments
3 min read
Utilizing the Go 1.25 Flight Recorder with tracing middleware

Utilizing the Go 1.25 Flight Recorder with tracing middleware

Comments
6 min read
Google SRE NALSD Round — A Real Interview Walkthrough

Google SRE NALSD Round — A Real Interview Walkthrough

Comments
7 min read
Top 10 SRE Tools Dominating 2026: The Ultimate Toolkit for Reliability Engineers 🚀

Top 10 SRE Tools Dominating 2026: The Ultimate Toolkit for Reliability Engineers 🚀

5
Comments
3 min read
Top 7 AI Tools Every DevOps and SRE Engineer Needs in 2026 🚀

Top 7 AI Tools Every DevOps and SRE Engineer Needs in 2026 🚀

3
Comments
3 min read
Infra Proverbs

Infra Proverbs

Comments
1 min read
Project: One App — Three Probes — Real Failures

Project: One App — Three Probes — Real Failures

1
Comments
3 min read
Kubernetes In-Place Pod Resize

Kubernetes In-Place Pod Resize

Comments
3 min read
Introduction to System Design: A Beginner’s Guide

Introduction to System Design: A Beginner’s Guide

Comments
4 min read
Lessons in Testing, Performance, and Legacy Systems from /dev/mtl 2025

Lessons in Testing, Performance, and Legacy Systems from /dev/mtl 2025

Comments
7 min read
Rightsizing Kubernetes Requests with the In-Place Vertical Pod Autoscaler

Rightsizing Kubernetes Requests with the In-Place Vertical Pod Autoscaler

2
Comments
3 min read
AWS Security Series: AWS Access Key is Compromised. Now What? An Incident Response Playbook.

AWS Security Series: AWS Access Key is Compromised. Now What? An Incident Response Playbook.

Comments
3 min read
Bash Scripting for Non-Coders

Bash Scripting for Non-Coders

Comments
37 min read
What is performance engineering: A Gatling take

What is performance engineering: A Gatling take

Comments
8 min read
What 100+ Production Incidents Taught Me About System Design

What 100+ Production Incidents Taught Me About System Design

9
Comments 5
5 min read
A practical guide to observability TCO and cost reduction

A practical guide to observability TCO and cost reduction

6
Comments
13 min read
Announcing Reliability Delta: Clear, Objective Insight into Whether Your Release Made Your System Better or Worse

Announcing Reliability Delta: Clear, Objective Insight into Whether Your Release Made Your System Better or Worse

Comments
4 min read
Google A2UI: The Future of Agentic AI for DevOps & SRE (Goodbye Text-Only ChatOps)

Google A2UI: The Future of Agentic AI for DevOps & SRE (Goodbye Text-Only ChatOps)

Comments
4 min read
When AI Writes Your Code, DevOps Becomes the Last Line of Defense

When AI Writes Your Code, DevOps Becomes the Last Line of Defense

3
Comments
4 min read
loading...