I got tired of monitoring blind spots, so I built something to find them

#ai #devops #python #monitoring

I've been thinking about this for a while. We have automated checks for code quality, security, and test coverage but for monitoring we just hope it's fine.

I've spent years on the other side of this - incident commander, on P1 calls, running RCA/CAPAs ect.. The pattern was always the same: something breaks, users report it before we even know, and during the post-incident review we realize that proper monitoring would have caught it earlier. We were always reactive - hearing about problems from affected users instead of catching them ourselves. After seeing that happen enough times, I wanted a way to proactively find those gaps before they turn into incidents.

So I decided to build something. I started working on a tool that connects to monitoring stack (PagerDuty, Datadog, Grafana, Sentry, New Relic, ect..) and runs a gap analysis. Not to capture "are your services up" but more like "do your services actually have alerts configured, and are those alerts going somewhere useful."

The system pulls configs through each tool's API, then checks for stuff like services with no escalation policy, alert rules with no notification channel attached, monitors that haven't received data in 30+ days, scheduled searches with alerting disabled. Each issue gets a severity (critical/warning/info) and a concrete fix suggestion generated by AI. It also scores your setup across coverage dimensions, alert coverage, notification routing, dashboard health, so you can see where the biggest gaps are in a single pane of glass.

I added AI on top to generate prioritized recommendations, and an "Incident Autopilot" that pulls live data from the connected tools when something goes wrong. If you describe a symptom like "checkout is slow" it maps the blast radius across services, identifies who's on call, and builds an investigation playbook.

As I kept working on this system, more use cases came through my mind so the latest thing I added is a PR/MR scanner. It integrates with GitHub/GitLab webhooks and when someone opens a PR that adds a new API endpoint or database connection, it flags it and suggests what monitors you should be added before merging.

The part I'm stuck on is that I'm at the stage where it works (there's a live demo you can try without creating an account) but I'm not sure if I'm solving a real problem or just my problem.

Some questions I keep going back and forth on:

Is this actually painful enough? Most teams I've talked to know their monitoring has gaps. But is it painful enough that they'd connect a third-party tool to audit it? Or do they just accept the risk?

A few people have told me "I could write a script for this or for that" And yeah, you could write a script that checks PagerDuty escalation policies. But would you also write one for Datadog monitors, Grafana alert rules, Sentry project configs, and keep them all updated? At some point it's not a script anymore.

This system requires read-only API tokens to your monitoring tools. I get why that makes people nervous. The tokens are encrypted at rest and never stored in plaintext, but the trust barrier is real.

If you want to poke at it, the demo is at https://getcova.ai , you can click "Enter Demo" and explore with synthetic data, no signup needed.

Would genuinely love to hear from anyone who's dealt with monitoring coverage gaps on their team. How do you handle it today? Is it just tribal knowledge and hope?

DEV Community

I got tired of monitoring blind spots, so I built something to find them

Top comments (0)