DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Design DEGRADE (Defer) and Your Agent Becomes “Operations”

Design DEGRADE (Defer) and Your Agent Becomes “Operations”

1
Comments
7 min read
The Next Frontier of SRE: Agentic Operations and Immutable Trust

The Next Frontier of SRE: Agentic Operations and Immutable Trust

Comments
3 min read
I’m looking for a small number of maintainers for NornicDB

I’m looking for a small number of maintainers for NornicDB

Comments
2 min read
Using Graphify to turn Incident Data into a Knowledge Graph

Using Graphify to turn Incident Data into a Knowledge Graph

2
Comments 1
3 min read
AWS-native incident investigation PoC

AWS-native incident investigation PoC

Comments
2 min read
Don’t “Execute” the LLM: Typed Actions + Verifiers for Safe Business Agents

Don’t “Execute” the LLM: Typed Actions + Verifiers for Safe Business Agents

1
Comments
8 min read
Are AI Observability Tools Actually Helping?

Are AI Observability Tools Actually Helping?

10
Comments
1 min read
Something every senior engineer learns the expensive way:

Something every senior engineer learns the expensive way:

1
Comments
1 min read
A hard-earned rule from incident retrospectives:

A hard-earned rule from incident retrospectives:

1
Comments
1 min read
One insight that changed how I design systems:

One insight that changed how I design systems:

Comments
1 min read
The Nines Are Lying to You: What 99.9% Uptime Actually Costs

The Nines Are Lying to You: What 99.9% Uptime Actually Costs

2
Comments 1
4 min read
I built an AI tool for incident investigation (looking for honest feedback)

I built an AI tool for incident investigation (looking for honest feedback)

1
Comments
2 min read
Determinism Series: Siliconizing Decision-Making (Index)

Determinism Series: Siliconizing Decision-Making (Index)

1
Comments
4 min read
Zero Data Loss Migration: Moving Billions of Rows from SQL Server to Aurora RDS — Architecture, Predictive CDC Monitoring & Lessons from Production

Zero Data Loss Migration: Moving Billions of Rows from SQL Server to Aurora RDS — Architecture, Predictive CDC Monitoring & Lessons from Production

4
Comments 2
7 min read
From ticket to PR with agents: how to use Claude to automate platform changes without breaking SLOs

From ticket to PR with agents: how to use Claude to automate platform changes without breaking SLOs

Comments
3 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.