DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Designing Alerts That Matters using Amazon CloudWatch

Designing Alerts That Matters using Amazon CloudWatch

Comments
4 min read
Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

1
Comments
10 min read
We Tested 30 LLM APIs with 150 Real Calls — 42.7% Failed (And Why That's Good News)

We Tested 30 LLM APIs with 150 Real Calls — 42.7% Failed (And Why That's Good News)

1
Comments 1
3 min read
ObserveX: Building a Centralized Observability Platform for Modern Infrastructure

ObserveX: Building a Centralized Observability Platform for Modern Infrastructure

Comments
12 min read
Building a Production-Grade Observability Platform for the Anvila API with LGTM, SLOs, DORA Metrics, and Game Day Testing

Building a Production-Grade Observability Platform for the Anvila API with LGTM, SLOs, DORA Metrics, and Game Day Testing

1
Comments 2
10 min read
Energy Grid Observability: What the Power Sector Can Learn from Google SRE

Energy Grid Observability: What the Power Sector Can Learn from Google SRE

1
Comments
12 min read
How to Choose a European Dedicated Server: Tier III vs Tier II Data Centers Explained

How to Choose a European Dedicated Server: Tier III vs Tier II Data Centers Explained

Comments
4 min read
How I took down 30% of production with one TLS fingerprinting rule

How I took down 30% of production with one TLS fingerprinting rule

Comments
6 min read
Building a Status Page From Scratch vs Using a Service: A Cost Analysis

Building a Status Page From Scratch vs Using a Service: A Cost Analysis

Comments
4 min read
JA4's split format saved our metrics cardinality

JA4's split format saved our metrics cardinality

Comments
1 min read
Agentic AI in DevOps: Useful Only After You Add Guardrails

Agentic AI in DevOps: Useful Only After You Add Guardrails

7
Comments 2
4 min read
AIOps That Actually Helps: Start with Telemetry, Correlation, and Safe Automation

AIOps That Actually Helps: Start with Telemetry, Correlation, and Safe Automation

8
Comments
5 min read
What Changes and What Stays the Same for SRE with AWS Frontier Agents

What Changes and What Stays the Same for SRE with AWS Frontier Agents

2
Comments
12 min read
# How I Built an On-Call Agent That Never Forgets a Past Incident

# How I Built an On-Call Agent That Never Forgets a Past Incident

Comments
5 min read
We've Normalized AI Outages, and That Should Bother You

We've Normalized AI Outages, and That Should Bother You

2
Comments 4
2 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.