DEV Community

ZNY
ZNY

Posted on

The Monitoring Stack We Actually Use in Production

The Monitoring Stack We Actually Use in Production

Prometheus, Grafana, and three things nobody talks about until they break.

Our Stack

Prometheus for metrics. Grafana for dashboards. PagerDuty for alerts. Sentry for errors. Standard setup.

What Nobody Talks About

Alert fatigue. We had 200 alerts. 180 were noise. Engineers started ignoring everything. When real problems came, nobody noticed for 4 hours.

Dashboard rot. Grafana dashboards that nobody updates. Engineers trust the numbers but nobody checks if the queries are still correct.

Correlation is hard. An alert fires. You spend 40 minutes correlating logs, metrics, and traces to find the real cause.

What We Changed

Cut alerts from 200 to 30. Only alert on symptoms, not causes. CPU at 90% is a cause. Error rate spike is a symptom. Alert on symptoms.

Added runbook links to every alert. When it fires, you know exactly what to do.

Set up a weekly dashboard review. Dead dashboards get archived.

The ROI

Oncall rotation went from miserable to manageable. Mean time to detect dropped from 45 minutes to 8 minutes. Mean time to resolve dropped from 3 hours to 45 minutes.


What is your monitoring setup?

Top comments (0)