How to Audit Your Monitoring Stack (Before the Next Incident Does It for You)

#devops #ai #sre #monitoring

If you've been in this industry long enough, you've sat through a post-mortem where someone says "we should have had monitoring for that". Maybe it was an endpoint that went down and nobody knew until a customer reported about it. Maybe it was an escalation policy that pointed to someone who left the company six months ago.

The annoying part is that these aren't hard problems. They're just invisible ones. Nobody wakes up and thinks "I should check if our PagerDuty escalation policies all have valid responders". You set things up, they work for a while, and then stuff drifts.

So here's what you should actually check when you audit a monitoring setup. Not the theoretical "you should have observability" stuff - the specific, concrete things that catch on fire when you don't look at them.

Escalation policies with nobody home

This one bites more teams than you'd think. Someone leaves, their PagerDuty schedule doesn't get updated, and now there's a gap every third Tuesday where alerts go to a person who no longer exists. The alert fires, gets acknowledged by nobody, and times out.

What to check:

Do all escalation policies have at least two levels?
Are there schedules with gaps (unassigned time blocks)?
Is there a catch-all policy for services that don't match anything specific?

Alert rules that don't actually notify anyone
This is shockingly common in Datadog and Grafana setups. Someone creates a monitor, sets the threshold, and forgets to add a notification channel. Or they add a chat channel (i.e Slack) that got archived. The monitor dutifully fires, nobody sees it, and you find out about the outage from your users.

Look for:

Monitors with no notification targets
Notification channels that point to archived Slack channels or deleted email groups
Monitors in a "no data" state for more than a few days - they're probably broken.

Dashboards with empty panels
Sounds trivial, but dashboards are often the first thing people open during an incident. If half the panels show "no data" or reference metrics that no longer exist, you're flying blind when it matters most. I've seen dashboards that looked great when they were built, but the underlying metric names changed during a migration and nobody updated them.

The coverage gaps nobody thinks about

Endpoints with no monitoring
Most teams have monitoring on their main API endpoints. But what about that internal service that handles webhooks? The admin panel that processes bulk operations? The cron job that reconciles billing data?

The pattern is always the same: critical business logic gets deployed, someone says "we'll add monitoring later" and later never comes. Then that service goes down on a Friday evening.

To actually find these, you need to compare what's in your codebase against what's in your monitoring tools. If your repo has 40 HTTP endpoints and your Datadog has monitors for 12 of them, those other 28 are blind spots.

Database monitoring that stops at "is it up?"
Checking that Postgres is responding to pings is table stakes. What about slow queries? Connection pool exhaustion? Replication lag? Disk space trending toward full?

The database is usually the last thing to go down and the hardest thing to recover from. "The database is up" and "the database is healthy" are very different statements.

Error tracking without thresholds
If you're using Sentry or a similar tool, you're probably capturing errors. But are you alerting on them? A lot of teams have Sentry collecting thousands of events with no alerting rules, so they only check it reactively after something breaks. Set up thresholds - if error rates spike 5x above baseline, that's worth a ping.

How to think about coverage
I find it useful to think about monitoring coverage across dimensions rather than as a single metric. A team might have great alert routing but zero database monitoring. If you just look at "do we have monitors?" the answer is yes, but you're still exposed.

The dimensions you should check:

Endpoint monitoring - Are your HTTP endpoints covered? Latency, error rate, availability.
Database monitoring - Connection pools, query performance, replication, disk.
Error tracking - Are errors being captured AND alerted on?
Alert quality - Do alerts have proper thresholds, or are they too noisy/too quiet?
Escalation routing - Does every alert have a path to a human?
Dashboard health - Are dashboards up to date and functional?
Business flows - Are the critical user journeys (checkout, signup, payment) monitored end-to-end?
Infrastructure - CPU, memory, disk, network at the host/container level.

Scoring each dimension separately gives you a much clearer picture than a single "monitoring health" number. You might be at 90% on alerts but 20% on database monitoring - and that 20% is where the next outage is hiding.

The multi-tool problem
Here's the thing most "monitoring best practices" articles miss: almost nobody uses a single tool. The average team has 3-5 monitoring tools. PagerDuty for alerting, Datadog for metrics, Sentry for errors, Grafana for dashboards, maybe New Relic for APM.

Each tool has its own configuration, its own gaps, and its own way of doing things. An audit that only looks at Datadog misses the PagerDuty escalation gaps. An audit that only looks at PagerDuty misses the Datadog monitors with no notifications or proper thresholds.

The real blind spots live in the spaces between tools. Service A is monitored in Datadog but its alerts route through a PagerDuty policy that hasn't been updated. Service B has a Grafana dashboard but no actual alerting. Service C is in Sentry but nobody set up alert rules.

To actually audit your stack, you need to look across all of them at once.

Making it repeatable

The worst part about monitoring audits is that they rot. You can do a thorough audit today, fix everything, and in three months new services have been deployed without monitoring, someone changed a PagerDuty schedule, and a Slack channel got archived. You're back where you started.

The fix is to make auditing a regular thing - not a quarterly project, but something that runs continuously or at least on every deploy. This is why I built Cova - it connects to your monitoring tools, runs the audit automatically, and shows you exactly where the gaps are. It also scans PRs to catch new endpoints shipping without monitoring, so the drift doesn't happen in the first place.

If you want to see what this looks like on a real setup, try the demo on the landing page - no signup needed. It runs through a sample monitoring stack and shows the kind of findings a typical audit surfaces.

But even without a tool, the checklist above will get you pretty far. Pick one dimension, check it this week. Pick another one next week. You'll be surprised what you find.

The goal isn't perfection - it's not finding out about your blind spots from an angry customer at 3am.

DEV Community

How to Audit Your Monitoring Stack (Before the Next Incident Does It for You)

Top comments (0)