DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
A 10% traffic spike took down a stable system in 3 minutes and 47 seconds.

A 10% traffic spike took down a stable system in 3 minutes and 47 seconds.

3
Comments
3 min read
The Big Tech Reality Check: Why "Senior" Architecture Fails at Global Scale

The Big Tech Reality Check: Why "Senior" Architecture Fails at Global Scale

Comments 1
3 min read
What Actually Happens When You Put an AI Agent on Call

What Actually Happens When You Put an AI Agent on Call

9
Comments 2
3 min read
Why is Infrastructure-as-Code so important? Hint: It's correctness

Why is Infrastructure-as-Code so important? Hint: It's correctness

Comments
2 min read
Real-World Incident Automation Using GCP: How I Cut MTTR by 80%

Real-World Incident Automation Using GCP: How I Cut MTTR by 80%

1
Comments
7 min read
Pourquoi mon serveur est devenu lent : le cas du disque SMR

Pourquoi mon serveur est devenu lent : le cas du disque SMR

Comments
2 min read
Scaling SRE Systems with GCP + Kubernetes: Lessons from Running at 10x Traffic

Scaling SRE Systems with GCP + Kubernetes: Lessons from Running at 10x Traffic

1
Comments
5 min read
OpenTelemetry vs Logstash - Which Logging Tool Is Right for You?

OpenTelemetry vs Logstash - Which Logging Tool Is Right for You?

1
Comments
9 min read
OpenTelemetry vs Loki - Choosing the Right Observability Tool

OpenTelemetry vs Loki - Choosing the Right Observability Tool

1
Comments
13 min read
OpenTelemetry Events vs Logs - Key Differences Explained

OpenTelemetry Events vs Logs - Key Differences Explained

1
Comments
15 min read
Your Kubernetes Cluster Shouldn't Need You at 3am

Your Kubernetes Cluster Shouldn't Need You at 3am

Comments
1 min read
Building a Production-Grade Observability Stack from Scratch (and What I Learned)

Building a Production-Grade Observability Stack from Scratch (and What I Learned)

2
Comments
5 min read
Designing Systems That Expect Their Own Assumptions to Break

Designing Systems That Expect Their Own Assumptions to Break

2
Comments
11 min read
Building a Rate Limiting As A Service: Across Modern Systems Rate Limiter

Building a Rate Limiting As A Service: Across Modern Systems Rate Limiter

2
Comments
3 min read
The Economics of Reliability: Cost, Risk, and Architectural Tradeoffs

The Economics of Reliability: Cost, Risk, and Architectural Tradeoffs

2
Comments
13 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.