DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
When Retries Turn Hostile — How Control Logic Kills Production Systems

When Retries Turn Hostile — How Control Logic Kills Production Systems

1
Comments
4 min read
agent-sre on PyPI: what SRE for AI agents actually means

agent-sre on PyPI: what SRE for AI agents actually means

Comments
2 min read
Agent Sprawl is Your Next Production Incident: An SRE Response to Datadog's State of AI Engineering 2026

Agent Sprawl is Your Next Production Incident: An SRE Response to Datadog's State of AI Engineering 2026

Comments
5 min read
Multi-Region Failover: Lessons from Running It Hot

Multi-Region Failover: Lessons from Running It Hot

Comments
3 min read
Multi-Region Failover: Lessons from Running It Hot

Multi-Region Failover: Lessons from Running It Hot

Comments
3 min read
How We Design Systems That Keep Working Even When One Part Fails

How We Design Systems That Keep Working Even When One Part Fails

Comments
2 min read
Disaster Recovery Drills That Actually Work

Disaster Recovery Drills That Actually Work

Comments
3 min read
Disaster Recovery Drills That Actually Work

Disaster Recovery Drills That Actually Work

Comments
3 min read
AI is a Non-Deterministic Guest in a Deterministic House: Stop Building Chatbots, Start Building Sandboxes

AI is a Non-Deterministic Guest in a Deterministic House: Stop Building Chatbots, Start Building Sandboxes

1
Comments
4 min read
Go Circuit Breakers That Fail Friendly: The 94% Cascade Prevention We Measured

Go Circuit Breakers That Fail Friendly: The 94% Cascade Prevention We Measured

Comments
13 min read
How to Compute Zero Trust Effectiveness: Four Metrics That Survive a Breach

How to Compute Zero Trust Effectiveness: Four Metrics That Survive a Breach

Comments
5 min read
MCP in Production Reality vs the Spec

MCP in Production Reality vs the Spec

Comments
3 min read
RAG vs MCP is the wrong debate — here's the right framing for production AI systems

RAG vs MCP is the wrong debate — here's the right framing for production AI systems

Comments
4 min read
“But it worked on my machine.”

“But it worked on my machine.”

Comments
1 min read
How I Created a DDoS Protection Engine

How I Created a DDoS Protection Engine

Comments
11 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.