DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
On-Call Best Practices: An SRE Guide to Incident Response

On-Call Best Practices: An SRE Guide to Incident Response

Comments
2 min read
Kubecost Explained: Kubernetes FinOps That Moves the Bill

Kubecost Explained: Kubernetes FinOps That Moves the Bill

Comments
12 min read
Your OTel Traces Are Lying to You Observability for the Reasoning Layer

Your OTel Traces Are Lying to You Observability for the Reasoning Layer

Comments 1
5 min read
OpenSRE: Build Your Own AI Incident-Investigation Agent

OpenSRE: Build Your Own AI Incident-Investigation Agent

5
Comments 1
3 min read
Production Log Parsing Patterns That Break Real Kubernetes Clusters (and How to Fix Them)

Production Log Parsing Patterns That Break Real Kubernetes Clusters (and How to Fix Them)

Comments
4 min read
How We Built Our Own Incident Management System

How We Built Our Own Incident Management System

Comments
2 min read
The Future Guide for Escaping Single-Provider Administrative Failure

The Future Guide for Escaping Single-Provider Administrative Failure

Comments
6 min read
I Made 4 LLMs Argue With Each Other to Write Better Runbooks. Here's What Happened.

I Made 4 LLMs Argue With Each Other to Write Better Runbooks. Here's What Happened.

Comments
5 min read
We're hiring a DevOps Content Engineer – Remote LATAM

We're hiring a DevOps Content Engineer – Remote LATAM

2
Comments
1 min read
The Runbook Is Already Lying to you.

The Runbook Is Already Lying to you.

Comments
8 min read
The Role of Platform Engineering in a Startup

The Role of Platform Engineering in a Startup

Comments
2 min read
The Architecture — Prometheus, Grafana, and StatsD for Batch Workloads

The Architecture — Prometheus, Grafana, and StatsD for Batch Workloads

Comments
5 min read
Debugging Production Alerts Without Chasing The Wrong Problem

Debugging Production Alerts Without Chasing The Wrong Problem

Comments
2 min read
Why Developers Should Learn How Systems Fail

Why Developers Should Learn How Systems Fail

Comments
3 min read
etcd database space exceeded: full recovery guide for on-prem Kubernetes

etcd database space exceeded: full recovery guide for on-prem Kubernetes

Comments
8 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.