DEV Community

# reliability

General discussions on building and maintaining reliable software systems.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Runbooks: Anatomy, Examples, and the AI-Executable Format

Runbooks: Anatomy, Examples, and the AI-Executable Format

Comments
9 min read
MTTA, MTTR, MTBF, MTTF — The Four Incident Metrics, Compared

MTTA, MTTR, MTBF, MTTF — The Four Incident Metrics, Compared

Comments
6 min read
SLO vs SLA vs SLI: What Each One Means and How to Set Them

SLO vs SLA vs SLI: What Each One Means and How to Set Them

Comments
8 min read
MTBF Full Form: Mean Time Between Failures — Meaning, Formula, and When It Matters

MTBF Full Form: Mean Time Between Failures — Meaning, Formula, and When It Matters

Comments
6 min read
Incident Severity Levels: Sev1–Sev4 with Triage Matrix

Incident Severity Levels: Sev1–Sev4 with Triage Matrix

Comments
6 min read
Beyond DORA: A Five-Metric Framework for SRE Maturity in Regulated Enterprises

Beyond DORA: A Five-Metric Framework for SRE Maturity in Regulated Enterprises

Comments
13 min read
Why We Restart to Fix It

Why We Restart to Fix It

Comments
7 min read
Why staging environments mislead and how to build reliable testing

Why staging environments mislead and how to build reliable testing

Comments
3 min read
Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse

Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse

Comments
11 min read
The Silent-Success Trap: Your Monitoring Is Green and You Still Shipped Nothing

The Silent-Success Trap: Your Monitoring Is Green and You Still Shipped Nothing

1
Comments
4 min read
Automatic Error Recovery in AI Agent Networks

Automatic Error Recovery in AI Agent Networks

2
Comments
2 min read
The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

Comments
11 min read
Eleven silent-failure modes across 36 agent platforms, and the structural feature they share

Eleven silent-failure modes across 36 agent platforms, and the structural feature they share

Comments
5 min read
How we survived 218 network transitions with zero data loss: ALEF's self-healing architecture

How we survived 218 network transitions with zero data loss: ALEF's self-healing architecture

Comments
2 min read
Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Comments
8 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.