DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
How we cut alert noise 80% with semantic correlation (and a little LLM RCA)

How we cut alert noise 80% with semantic correlation (and a little LLM RCA)

3
Comments
4 min read
Kiln Crisis Management: Controlling Irregular Raw Meal in CCR Using Python

Kiln Crisis Management: Controlling Irregular Raw Meal in CCR Using Python

Comments
3 min read
The Hidden Cost of Reactive AIOps: Why Auto-Remediation Without Memory Fails

The Hidden Cost of Reactive AIOps: Why Auto-Remediation Without Memory Fails

3
Comments
9 min read
The logs said everything was fine.

The logs said everything was fine.

Comments
1 min read
A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet

A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet

Comments
8 min read
The Actual Cost of Self-Hosting Your LLM (Nobody Does This Math First)

The Actual Cost of Self-Hosting Your LLM (Nobody Does This Math First)

Comments
4 min read
Incident Communication: The Status Page That Builds Trust

Incident Communication: The Status Page That Builds Trust

Comments
3 min read
OCI Run Command Advanced Guide: Remote Execution, Object Storage Scripts, and Production Troubleshooting

OCI Run Command Advanced Guide: Remote Execution, Object Storage Scripts, and Production Troubleshooting

Comments
4 min read
Load Testing in Production: How We Do It Safely

Load Testing in Production: How We Do It Safely

Comments
3 min read
DORA metrics are a CFO tool, not a dev tool

DORA metrics are a CFO tool, not a dev tool

Comments
2 min read
Delete 40% of your dashboards

Delete 40% of your dashboards

Comments
2 min read
Your Datadog bill is 60% DEBUG logs

Your Datadog bill is 60% DEBUG logs

Comments
2 min read
Effective On-Call Rotations: Lessons From Building Fair Schedules

Effective On-Call Rotations: Lessons From Building Fair Schedules

Comments
3 min read
Why Most Internal Developer Platforms Fail (And What To Do About It)

Why Most Internal Developer Platforms Fail (And What To Do About It)

Comments 1
2 min read
Agents, context, and guardrails on a unified platform

Agents, context, and guardrails on a unified platform

2
Comments
3 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.