DEV Community

Yash
Yash

Posted on

How to Design a DevOps Monitoring Strategy That Actually Works

How to Design a DevOps Monitoring Strategy That Actually Works

Most monitoring strategies fail for the same reason: they alert on what is easy to measure, not what actually matters.

This is a guide to designing monitoring from first principles, starting with what your users experience, not what your tools can track.

The Wrong Way to Start

The wrong way: open CloudWatch, start creating alarms for every metric that exists. CPU utilization, memory usage, disk space, network IO, ECS task count.

Within a week you have 200 alarms. Within a month, 90% of them are firing regularly and being ignored. Your team has alert fatigue before they even have a mature product.

This is stage 2 of the monitoring maturity model most teams go through:

  • Stage 1: No monitoring, discover problems from user reports
  • Stage 2: Alert on everything, constant noise, low signal
  • Stage 3: Tune aggressively to reduce noise, miss real problems
  • Stage 4: Symptom-based monitoring from user experience, actually useful

Most teams get stuck at stage 3 and call it good enough.

Start With User Experience, Not Infrastructure Metrics

The right first question is not what can we measure. It is what does degraded service look like for our users.

For a REST API: requests are taking longer to complete, requests are failing with error responses, requests are being dropped without response.

For a background job processor: jobs are taking longer to complete than expected, jobs are failing and not being retried, job queue depth is growing faster than it is being processed.

These are the things that matter. Everything else is a potential cause, not an indicator of actual user impact.

The Four Signals Worth Measuring

Google SRE introduced the concept of the Four Golden Signals. After years of designing monitoring systems, I believe this is the right starting point.

Latency: how long does it take to service requests? Track both successful and failed request latency separately.

Traffic: how much demand is the system handling? Sudden drops in traffic are often more concerning than sudden increases. Drops can indicate a dependency is rejecting your requests before they even reach your service.

Errors: what is the rate of failed requests? Track both explicit failures such as 5xx responses and implicit failures such as successful responses with incorrect content.

Saturation: how full is your service? For CPU-bound services this is CPU utilization. For IO-bound services it is disk or network throughput. For stateful services it is queue depth or connection pool utilization.

Building Your Alert Hierarchy

Not all alerts are equal. Mixing critical and informational alerts in the same channel is how you get alert fatigue.

Tier 1 page immediately: service is completely unavailable, error rate above 10% for more than 5 minutes, p99 latency above SLA threshold for more than 10 minutes.

Tier 2 notify in business hours: error rate above 5% for more than 15 minutes, p99 latency trending upward, disk usage above 80%, queue depth above normal range.

Tier 3 log for review with no notification: any metric crossing a soft threshold, scheduled job completion, deployment events.

Most teams page on Tier 2 and 3 items. Then they train their engineers to ignore pages.

Runbooks for Every Alert

Every alert that can wake someone up should have a corresponding runbook. Not because engineers are not smart enough to figure it out, but because at 3am having a starting point saves critical minutes.

A good runbook contains what this alert means in plain language, the most likely causes ordered by probability, first diagnostic steps, escalation path if the first steps do not resolve it, and how to silence the alert while working the incident.

Most teams write runbooks after an incident. Write them before. Your future on-call self will thank you.

What I Am Building Into Step2Dev

Step2Dev will include opinionated monitoring templates based on service type, not blank-slate dashboards that require you to know what to monitor before you can monitor anything.

Deploy a new web service, get sensible Four Golden Signal monitoring configured automatically. Deploy a background job processor, get queue depth and job success rate monitoring.

Still configurable. But not starting from zero.

More at step2dev.com.

What is your monitoring setup like? Are you at stage 2, 3, or 4?

Top comments (0)