DEV Community: paulg7516

The monitoring gaps that page you at 3am are the ones you didn't know existed

paulg7516 — Sun, 12 Apr 2026 18:56:23 +0000

I've been on the SRE side for a while - mostly incident management. Which means I've sat through hundreds of post-mortems where the root cause was fine but the real question was "why did it take us 45 minutes to even know something was wrong?"

The answer is almost always the same. We were monitoring the things we knew about. But the service that actually broke? Nobody ever set up an alert for it. Maybe it got spun up six months ago and the team

that built it moved on. Maybe it was a background worker that everyone assumed someone else was watching. Doesn't matter - the gap was there and we found it the hard way. And every time, the action item is "add monitoring for X." Great. What about the next X we don't know about yet?

That's the thing I couldn't let go of. Not "are our alerts tuned right" but "what are we completely blind to right now?"

So I built something to answer that question.

What Cova does
You connect your monitoring tools - PagerDuty, Datadog, Grafana, Sentry, New Relic, whatever combination you're running. Cova reads your existing setup and tells you where the holes are.

Not theoretical "best practice" stuff but actual gaps. Like:

Your checkout service has latency monitoring but no error rate alert
You added a new Postgres database three months ago and nothing watches the connection pool
Your API has 40 endpoints but only 12 have any monitoring at all

Then it writes the monitor config for you. Matching your existing naming patterns, your threshold ranges, your notification channels. You review it, click deploy, and it pushes directly to Datadog or Grafana or wherever it belongs.

The part I didn't expect to build

Once the scanning worked, I kept wanting to run it again after every deploy. So I added scheduled scans. Then I thought - if it can find gaps and write configs, why am I still the one clicking "deploy"? So it kind of evolved into an agentic setup with three modes:

Watch: I run a scan when I feel like it, fix things myself
Assist: it scans on a schedule and drafts configs for me to review
Autopilot: it finds gaps, generates monitors, and deploys them. I get a Slack message after.

There are enough guardrails for Autopilot (rate limits, duplicate detection, cooldown periods, only well-understood patterns) that it's been running for a while without doing anything dumb.

It also plugs into GitHub and flags when a PR introduces new endpoints or databases that don't have monitoring yet. As someone who's been on the receiving end of those "why wasn't this monitored" conversations - that one hits different.

Why I'm posting this

I need people to break it. Or tell me the gaps it finds are useless. Or that the generated configs are wrong. I've been testing against my own stack and a handful of friends' setups but that's not enough.

If you run PagerDuty, Datadog, Grafana, Sentry, or New Relic - I'd genuinely appreciate 10 minutes of your time. Connect one tool, run a scan, tell me if it found anything real or even check out the Demo Mode to get a feel for what it does and looks like.

I'll give you full Pro access. Not trying to bait-and-switch you into a sales call. I just want to know if this thing is useful to someone who isn't me.

Link: https://getcova.ai

Drop a comment if you have questions - happy to talk about how the scanning works, what it checks for, or why I made certain tradeoffs.

How to Audit Your Monitoring Stack (Before the Next Incident Does It for You)

paulg7516 — Thu, 26 Mar 2026 00:15:09 +0000

If you've been in this industry long enough, you've sat through a post-mortem where someone says "we should have had monitoring for that". Maybe it was an endpoint that went down and nobody knew until a customer reported about it. Maybe it was an escalation policy that pointed to someone who left the company six months ago.

The annoying part is that these aren't hard problems. They're just invisible ones. Nobody wakes up and thinks "I should check if our PagerDuty escalation policies all have valid responders". You set things up, they work for a while, and then stuff drifts.

So here's what you should actually check when you audit a monitoring setup. Not the theoretical "you should have observability" stuff - the specific, concrete things that catch on fire when you don't look at them.

Escalation policies with nobody home

This one bites more teams than you'd think. Someone leaves, their PagerDuty schedule doesn't get updated, and now there's a gap every third Tuesday where alerts go to a person who no longer exists. The alert fires, gets acknowledged by nobody, and times out.

What to check:

Do all escalation policies have at least two levels?
Are there schedules with gaps (unassigned time blocks)?
Is there a catch-all policy for services that don't match anything specific?

Alert rules that don't actually notify anyone
This is shockingly common in Datadog and Grafana setups. Someone creates a monitor, sets the threshold, and forgets to add a notification channel. Or they add a chat channel (i.e Slack) that got archived. The monitor dutifully fires, nobody sees it, and you find out about the outage from your users.

Look for:

Monitors with no notification targets
Notification channels that point to archived Slack channels or deleted email groups
Monitors in a "no data" state for more than a few days - they're probably broken.

Dashboards with empty panels
Sounds trivial, but dashboards are often the first thing people open during an incident. If half the panels show "no data" or reference metrics that no longer exist, you're flying blind when it matters most. I've seen dashboards that looked great when they were built, but the underlying metric names changed during a migration and nobody updated them.

The coverage gaps nobody thinks about

Endpoints with no monitoring
Most teams have monitoring on their main API endpoints. But what about that internal service that handles webhooks? The admin panel that processes bulk operations? The cron job that reconciles billing data?

The pattern is always the same: critical business logic gets deployed, someone says "we'll add monitoring later" and later never comes. Then that service goes down on a Friday evening.

To actually find these, you need to compare what's in your codebase against what's in your monitoring tools. If your repo has 40 HTTP endpoints and your Datadog has monitors for 12 of them, those other 28 are blind spots.

Database monitoring that stops at "is it up?"
Checking that Postgres is responding to pings is table stakes. What about slow queries? Connection pool exhaustion? Replication lag? Disk space trending toward full?

The database is usually the last thing to go down and the hardest thing to recover from. "The database is up" and "the database is healthy" are very different statements.

Error tracking without thresholds
If you're using Sentry or a similar tool, you're probably capturing errors. But are you alerting on them? A lot of teams have Sentry collecting thousands of events with no alerting rules, so they only check it reactively after something breaks. Set up thresholds - if error rates spike 5x above baseline, that's worth a ping.

How to think about coverage
I find it useful to think about monitoring coverage across dimensions rather than as a single metric. A team might have great alert routing but zero database monitoring. If you just look at "do we have monitors?" the answer is yes, but you're still exposed.

The dimensions you should check:

Endpoint monitoring - Are your HTTP endpoints covered? Latency, error rate, availability.
Database monitoring - Connection pools, query performance, replication, disk.
Error tracking - Are errors being captured AND alerted on?
Alert quality - Do alerts have proper thresholds, or are they too noisy/too quiet?
Escalation routing - Does every alert have a path to a human?
Dashboard health - Are dashboards up to date and functional?
Business flows - Are the critical user journeys (checkout, signup, payment) monitored end-to-end?
Infrastructure - CPU, memory, disk, network at the host/container level.

Scoring each dimension separately gives you a much clearer picture than a single "monitoring health" number. You might be at 90% on alerts but 20% on database monitoring - and that 20% is where the next outage is hiding.

The multi-tool problem
Here's the thing most "monitoring best practices" articles miss: almost nobody uses a single tool. The average team has 3-5 monitoring tools. PagerDuty for alerting, Datadog for metrics, Sentry for errors, Grafana for dashboards, maybe New Relic for APM.

Each tool has its own configuration, its own gaps, and its own way of doing things. An audit that only looks at Datadog misses the PagerDuty escalation gaps. An audit that only looks at PagerDuty misses the Datadog monitors with no notifications or proper thresholds.

The real blind spots live in the spaces between tools. Service A is monitored in Datadog but its alerts route through a PagerDuty policy that hasn't been updated. Service B has a Grafana dashboard but no actual alerting. Service C is in Sentry but nobody set up alert rules.

To actually audit your stack, you need to look across all of them at once.

Making it repeatable

The worst part about monitoring audits is that they rot. You can do a thorough audit today, fix everything, and in three months new services have been deployed without monitoring, someone changed a PagerDuty schedule, and a Slack channel got archived. You're back where you started.

The fix is to make auditing a regular thing - not a quarterly project, but something that runs continuously or at least on every deploy. This is why I built Cova - it connects to your monitoring tools, runs the audit automatically, and shows you exactly where the gaps are. It also scans PRs to catch new endpoints shipping without monitoring, so the drift doesn't happen in the first place.

If you want to see what this looks like on a real setup, try the demo on the landing page - no signup needed. It runs through a sample monitoring stack and shows the kind of findings a typical audit surfaces.

But even without a tool, the checklist above will get you pretty far. Pick one dimension, check it this week. Pick another one next week. You'll be surprised what you find.

The goal isn't perfection - it's not finding out about your blind spots from an angry customer at 3am.

I got tired of monitoring blind spots, so I built something to find them

paulg7516 — Sun, 15 Mar 2026 23:04:46 +0000

I've been thinking about this for a while. We have automated checks for code quality, security, and test coverage but for monitoring we just hope it's fine.

I've spent years on the other side of this - incident commander, on P1 calls, running RCA/CAPAs ect.. The pattern was always the same: something breaks, users report it before we even know, and during the post-incident review we realize that proper monitoring would have caught it earlier. We were always reactive - hearing about problems from affected users instead of catching them ourselves. After seeing that happen enough times, I wanted a way to proactively find those gaps before they turn into incidents.

So I decided to build something. I started working on a tool that connects to monitoring stack (PagerDuty, Datadog, Grafana, Sentry, New Relic, ect..) and runs a gap analysis. Not to capture "are your services up" but more like "do your services actually have alerts configured, and are those alerts going somewhere useful."

The system pulls configs through each tool's API, then checks for stuff like services with no escalation policy, alert rules with no notification channel attached, monitors that haven't received data in 30+ days, scheduled searches with alerting disabled. Each issue gets a severity (critical/warning/info) and a concrete fix suggestion generated by AI. It also scores your setup across coverage dimensions, alert coverage, notification routing, dashboard health, so you can see where the biggest gaps are in a single pane of glass.

I added AI on top to generate prioritized recommendations, and an "Incident Autopilot" that pulls live data from the connected tools when something goes wrong. If you describe a symptom like "checkout is slow" it maps the blast radius across services, identifies who's on call, and builds an investigation playbook.

As I kept working on this system, more use cases came through my mind so the latest thing I added is a PR/MR scanner. It integrates with GitHub/GitLab webhooks and when someone opens a PR that adds a new API endpoint or database connection, it flags it and suggests what monitors you should be added before merging.

The part I'm stuck on is that I'm at the stage where it works (there's a live demo you can try without creating an account) but I'm not sure if I'm solving a real problem or just my problem.

Some questions I keep going back and forth on:

Is this actually painful enough? Most teams I've talked to know their monitoring has gaps. But is it painful enough that they'd connect a third-party tool to audit it? Or do they just accept the risk?

A few people have told me "I could write a script for this or for that" And yeah, you could write a script that checks PagerDuty escalation policies. But would you also write one for Datadog monitors, Grafana alert rules, Sentry project configs, and keep them all updated? At some point it's not a script anymore.

This system requires read-only API tokens to your monitoring tools. I get why that makes people nervous. The tokens are encrypted at rest and never stored in plaintext, but the trust barrier is real.

If you want to poke at it, the demo is at https://getcova.ai , you can click "Enter Demo" and explore with synthetic data, no signup needed.

Would genuinely love to hear from anyone who's dealt with monitoring coverage gaps on their team. How do you handle it today? Is it just tribal knowledge and hope?