<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: paulg7516</title>
    <description>The latest articles on DEV Community by paulg7516 (@paulg7516).</description>
    <link>https://dev.to/paulg7516</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3825974%2F96950dd4-8e8b-4f4f-a093-49c631fb9c89.png</url>
      <title>DEV Community: paulg7516</title>
      <link>https://dev.to/paulg7516</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/paulg7516"/>
    <language>en</language>
    <item>
      <title>How to Audit Your Monitoring Stack (Before the Next Incident Does It for You)</title>
      <dc:creator>paulg7516</dc:creator>
      <pubDate>Thu, 26 Mar 2026 00:15:09 +0000</pubDate>
      <link>https://dev.to/paulg7516/how-to-audit-your-monitoring-stack-before-the-next-incident-does-it-for-you-fkp</link>
      <guid>https://dev.to/paulg7516/how-to-audit-your-monitoring-stack-before-the-next-incident-does-it-for-you-fkp</guid>
      <description>&lt;p&gt;If you've been in this industry long enough, you've sat through a post-mortem where someone says "we should have had monitoring for that". Maybe it was an endpoint that went down and nobody knew until a customer reported about it. Maybe it was an escalation policy that pointed to someone who left the company six months ago.&lt;/p&gt;

&lt;p&gt;The annoying part is that these aren't hard problems. They're just invisible ones. Nobody wakes up and thinks "I should check if our PagerDuty escalation policies all have valid responders". You set things up, they work for a while, and then stuff drifts.&lt;/p&gt;

&lt;p&gt;So here's what you should actually check when you audit a monitoring setup. Not the theoretical "you should have observability" stuff - the specific, concrete things that catch on fire when you don't look at them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Escalation policies with nobody home&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one bites more teams than you'd think. Someone leaves, their PagerDuty schedule doesn't get updated, and now there's a gap every third Tuesday where alerts go to a person who no longer exists. The alert fires, gets acknowledged by nobody, and times out.&lt;/p&gt;

&lt;p&gt;What to check:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Do all escalation policies have at least two levels?&lt;/em&gt;&lt;br&gt;
Are there schedules with gaps (unassigned time blocks)?&lt;br&gt;
Is there a catch-all policy for services that don't match anything specific?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Alert rules that don't actually notify anyone&lt;/em&gt;&lt;br&gt;
This is shockingly common in Datadog and Grafana setups. Someone creates a monitor, sets the threshold, and forgets to add a notification channel. Or they add a chat channel (i.e Slack) that got archived. The monitor dutifully fires, nobody sees it, and you find out about the outage from your users.&lt;/p&gt;

&lt;p&gt;Look for:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Monitors with no notification targets&lt;/em&gt;&lt;br&gt;
Notification channels that point to archived Slack channels or deleted email groups&lt;br&gt;
Monitors in a "no data" state for more than a few days - they're probably broken.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Dashboards with empty panels&lt;/em&gt;&lt;br&gt;
Sounds trivial, but dashboards are often the first thing people open during an incident. If half the panels show "no data" or reference metrics that no longer exist, you're flying blind when it matters most. I've seen dashboards that looked great when they were built, but the underlying metric names changed during a migration and nobody updated them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The coverage gaps nobody thinks about&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Endpoints with no monitoring&lt;/em&gt;&lt;br&gt;
Most teams have monitoring on their main API endpoints. But what about that internal service that handles webhooks? The admin panel that processes bulk operations? The cron job that reconciles billing data?&lt;/p&gt;

&lt;p&gt;The pattern is always the same: critical business logic gets deployed, someone says "we'll add monitoring later" and later never comes. Then that service goes down on a Friday evening.&lt;/p&gt;

&lt;p&gt;To actually find these, you need to compare what's in your codebase against what's in your monitoring tools. If your repo has 40 HTTP endpoints and your Datadog has monitors for 12 of them, those other 28 are blind spots.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Database monitoring that stops at "is it up?"&lt;/em&gt;&lt;br&gt;
Checking that Postgres is responding to pings is table stakes. What about slow queries? Connection pool exhaustion? Replication lag? Disk space trending toward full?&lt;/p&gt;

&lt;p&gt;The database is usually the last thing to go down and the hardest thing to recover from. "The database is up" and "the database is healthy" are very different statements.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Error tracking without thresholds&lt;/em&gt;&lt;br&gt;
If you're using Sentry or a similar tool, you're probably capturing errors. But are you alerting on them? A lot of teams have Sentry collecting thousands of events with no alerting rules, so they only check it reactively after something breaks. Set up thresholds - if error rates spike 5x above baseline, that's worth a ping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to think about coverage&lt;/strong&gt;&lt;br&gt;
I find it useful to think about monitoring coverage across dimensions rather than as a single metric. A team might have great alert routing but zero database monitoring. If you just look at "do we have monitors?" the answer is yes, but you're still exposed.&lt;/p&gt;

&lt;p&gt;The dimensions you should check:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Endpoint monitoring&lt;/strong&gt; - Are your HTTP endpoints covered? Latency, error rate, availability.&lt;br&gt;
&lt;strong&gt;Database monitoring&lt;/strong&gt; - Connection pools, query performance, replication, disk.&lt;br&gt;
&lt;strong&gt;Error tracking&lt;/strong&gt; - Are errors being captured AND alerted on?&lt;br&gt;
&lt;strong&gt;Alert quality&lt;/strong&gt; - Do alerts have proper thresholds, or are they too noisy/too quiet?&lt;br&gt;
&lt;strong&gt;Escalation routing&lt;/strong&gt; - Does every alert have a path to a human?&lt;br&gt;
&lt;strong&gt;Dashboard health&lt;/strong&gt; - Are dashboards up to date and functional?&lt;br&gt;
&lt;strong&gt;Business flows&lt;/strong&gt; - Are the critical user journeys (checkout, signup, payment) monitored end-to-end?&lt;br&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt; - CPU, memory, disk, network at the host/container level.&lt;/p&gt;

&lt;p&gt;Scoring each dimension separately gives you a much clearer picture than a single "monitoring health" number. You might be at 90% on alerts but 20% on database monitoring - and that 20% is where the next outage is hiding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The multi-tool problem&lt;/strong&gt;&lt;br&gt;
Here's the thing most "monitoring best practices" articles miss: almost nobody uses a single tool. The average team has 3-5 monitoring tools. PagerDuty for alerting, Datadog for metrics, Sentry for errors, Grafana for dashboards, maybe New Relic for APM.&lt;/p&gt;

&lt;p&gt;Each tool has its own configuration, its own gaps, and its own way of doing things. An audit that only looks at Datadog misses the PagerDuty escalation gaps. An audit that only looks at PagerDuty misses the Datadog monitors with no notifications or proper thresholds.&lt;/p&gt;

&lt;p&gt;The real blind spots live in the spaces between tools. Service A is monitored in Datadog but its alerts route through a PagerDuty policy that hasn't been updated. Service B has a Grafana dashboard but no actual alerting. Service C is in Sentry but nobody set up alert rules.&lt;/p&gt;

&lt;p&gt;To actually audit your stack, you need to look across all of them at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Making it repeatable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The worst part about monitoring audits is that they rot. You can do a thorough audit today, fix everything, and in three months new services have been deployed without monitoring, someone changed a PagerDuty schedule, and a Slack channel got archived. You're back where you started.&lt;/p&gt;

&lt;p&gt;The fix is to make auditing a regular thing - not a quarterly project, but something that runs continuously or at least on every deploy. This is why I built &lt;a href="https://getcova.ai/" rel="noopener noreferrer"&gt;Cova&lt;/a&gt; - it connects to your monitoring tools, runs the audit automatically, and shows you exactly where the gaps are. It also scans PRs to catch new endpoints shipping without monitoring, so the drift doesn't happen in the first place.&lt;/p&gt;

&lt;p&gt;If you want to see what this looks like on a real setup, &lt;strong&gt;try the demo&lt;/strong&gt; on the landing page - no signup needed. It runs through a sample monitoring stack and shows the kind of findings a typical audit surfaces.&lt;/p&gt;

&lt;p&gt;But even without a tool, the checklist above will get you pretty far. Pick one dimension, check it this week. Pick another one next week. You'll be surprised what you find.&lt;/p&gt;

&lt;p&gt;The goal isn't perfection - it's not finding out about your blind spots from an angry customer at 3am.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>ai</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>I got tired of monitoring blind spots, so I built something to find them</title>
      <dc:creator>paulg7516</dc:creator>
      <pubDate>Sun, 15 Mar 2026 23:04:46 +0000</pubDate>
      <link>https://dev.to/paulg7516/we-audit-our-code-regularly-why-dont-we-audit-our-monitoring-lfh</link>
      <guid>https://dev.to/paulg7516/we-audit-our-code-regularly-why-dont-we-audit-our-monitoring-lfh</guid>
      <description>&lt;p&gt;I've been thinking about this for a while. We have automated checks for code quality, security, and test coverage but for monitoring we just hope it's fine.&lt;/p&gt;

&lt;p&gt;I've spent years on the other side of this - incident commander, on P1 calls, running RCA/CAPAs ect.. The pattern was always the same: something breaks, users report it before we even know, and during the post-incident review we realize that proper monitoring would have caught it earlier. We were always reactive - hearing about problems from affected users instead of catching them ourselves. After seeing that happen enough times, I wanted a way to proactively find those gaps before they turn into incidents. &lt;/p&gt;

&lt;p&gt;So I decided to build something. I started working on a tool that connects to monitoring stack (PagerDuty, Datadog, Grafana, Sentry, New Relic, ect..) and runs a gap analysis. Not to capture "are your services up" but more like "do your services actually have alerts configured, and are those alerts going somewhere useful."&lt;/p&gt;

&lt;p&gt;The system pulls configs through each tool's API, then checks for stuff like services with no escalation policy, alert rules with no notification channel attached, monitors that haven't received data in 30+ days, scheduled searches with alerting disabled. Each issue gets a severity (critical/warning/info) and a concrete fix suggestion generated by AI. It also scores your setup across coverage dimensions, alert coverage, notification routing, dashboard health, so you can see where the biggest gaps are in a single pane of glass.&lt;/p&gt;

&lt;p&gt;I added AI on top to generate prioritized recommendations, and an "Incident Autopilot" that pulls live data from the connected tools when something goes wrong. If you describe a symptom like "checkout is slow" it maps the blast radius across services, identifies who's on call, and builds an investigation playbook. &lt;/p&gt;

&lt;p&gt;As I kept working on this system, more use cases came through my mind so the latest thing I added is a PR/MR scanner. It integrates with GitHub/GitLab webhooks and when someone opens a PR that adds a new API endpoint or database connection, it flags it and suggests what monitors you should be added before merging. &lt;/p&gt;

&lt;p&gt;The part I'm stuck on is that I'm at the stage where it works (there's a live demo you can try without creating an account) but I'm not sure if I'm solving a real problem or just my problem.&lt;/p&gt;

&lt;p&gt;Some questions I keep going back and forth on:&lt;/p&gt;

&lt;p&gt;Is this actually painful enough? Most teams I've talked to know their monitoring has gaps. But is it painful enough that they'd connect a third-party tool to audit it? Or do they just accept the risk?&lt;/p&gt;

&lt;p&gt;A few people have told me "I could write a script for this or for that" And yeah, you could write a script that checks PagerDuty escalation policies. But would you also write one for Datadog monitors, Grafana alert rules, Sentry project configs, and keep them all updated? At some point it's not a script anymore.&lt;/p&gt;

&lt;p&gt;This system requires read-only API tokens to your monitoring tools. I get why that makes people nervous. The tokens are encrypted at rest and never stored in plaintext, but the trust barrier is real.&lt;/p&gt;

&lt;p&gt;If you want to poke at it, the demo is at &lt;a href="https://getcova.ai" rel="noopener noreferrer"&gt;https://getcova.ai&lt;/a&gt; , you can click "Enter Demo" and explore with synthetic data, no signup needed.&lt;/p&gt;

&lt;p&gt;Would genuinely love to hear from anyone who's dealt with monitoring coverage gaps on their team. How do you handle it today? Is it just tribal knowledge and hope?&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>ai</category>
      <category>devops</category>
      <category>python</category>
    </item>
  </channel>
</rss>
