DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Our Status Page Lied to Us: 7 Steps to Building a Communication Platform Customers Actually Trust

Our Status Page Lied to Us: 7 Steps to Building a Communication Platform Customers Actually Trust

2
Comments
9 min read
Stop Losing Launches to “Tiny Bugs”: 7 Engineering Principles Every PM Should Know

Stop Losing Launches to “Tiny Bugs”: 7 Engineering Principles Every PM Should Know

Comments
2 min read
Linux Fundamentals for DevOps & SRE: The Only Guide You'll Ever Need

Linux Fundamentals for DevOps & SRE: The Only Guide You'll Ever Need

6
Comments
15 min read
The Cost of Confusing SRE, DevOps, and Platform Engineering

The Cost of Confusing SRE, DevOps, and Platform Engineering

Comments
4 min read
The DevOps Engineer's Guide to AWS Cost Explorer

The DevOps Engineer's Guide to AWS Cost Explorer

Comments
1 min read
How We Built AI That Prevents Cloud Incidents Before They Happen

How We Built AI That Prevents Cloud Incidents Before They Happen

Comments
2 min read
Microservices and the Myth of Fault Isolation

Microservices and the Myth of Fault Isolation

Comments
3 min read
Importance of Graceful Shutdown in Kubernetes

Importance of Graceful Shutdown in Kubernetes

Comments
7 min read
The Hidden Cost of AI in SRE: Why Automation Hasn’t Fixed Burnout

The Hidden Cost of AI in SRE: Why Automation Hasn’t Fixed Burnout

1
Comments
2 min read
The Merge Queue Scaling Problem Every Growing Team Hits

The Merge Queue Scaling Problem Every Growing Team Hits

Comments
1 min read
Liveness vs Readiness in Kubernetes: The Truth for Frontend Apps

Liveness vs Readiness in Kubernetes: The Truth for Frontend Apps

Comments
2 min read
Gonzo - The Go based TUI for log analysis

Gonzo - The Go based TUI for log analysis

Comments
1 min read
Why SRE is not for entry-levels

Why SRE is not for entry-levels

Comments
2 min read
AI-Driven DevOps: How AIOps is Transforming Observability, Incident Response, and Automation

AI-Driven DevOps: How AIOps is Transforming Observability, Incident Response, and Automation

Comments 1
3 min read
Observability: Beyond Monitoring in Modern Systems

Observability: Beyond Monitoring in Modern Systems

Comments 1
3 min read
Why Self-Hosting made me a better engineer

Why Self-Hosting made me a better engineer

4
Comments
4 min read
Netlify Site + HCP Terraform Remote State

Netlify Site + HCP Terraform Remote State

Comments
3 min read
WTF is Site Reliability Engineering?

WTF is Site Reliability Engineering?

1
Comments
3 min read
Take Control of your Logs: Top 10 ways using the OpenTelemetry Collector

Take Control of your Logs: Top 10 ways using the OpenTelemetry Collector

Comments
2 min read
Amazon Cognito Observability Best Practices with Datadog

Amazon Cognito Observability Best Practices with Datadog

1
Comments
5 min read
Root Cause Analysis (RCA): entendendo a causa raiz de incidentes

Root Cause Analysis (RCA): entendendo a causa raiz de incidentes

10
Comments
2 min read
🚀 Mini Monitoring App in Go with Prometheus, Grafana & CI/CD

🚀 Mini Monitoring App in Go with Prometheus, Grafana & CI/CD

1
Comments 1
3 min read
The 67-Second OpenTelemetry Problem

The 67-Second OpenTelemetry Problem

Comments
4 min read
The Resilience Playbook: 23 Strategies for Bulletproof Applications 🚀

The Resilience Playbook: 23 Strategies for Bulletproof Applications 🚀

Comments
4 min read
DSA Won’t Save You in Production

DSA Won’t Save You in Production

Comments
2 min read
loading...