<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mike</title>
    <description>The latest articles on DEV Community by Mike (@mihail_147bfaf4bbb8ec9949).</description>
    <link>https://dev.to/mihail_147bfaf4bbb8ec9949</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3982729%2Fbf2ae330-92bd-4d49-b809-787323a565c9.png</url>
      <title>DEV Community: Mike</title>
      <link>https://dev.to/mihail_147bfaf4bbb8ec9949</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mihail_147bfaf4bbb8ec9949"/>
    <language>en</language>
    <item>
      <title>What Building Website Monitoring Taught Me About Silent Failures</title>
      <dc:creator>Mike</dc:creator>
      <pubDate>Sat, 13 Jun 2026 20:58:18 +0000</pubDate>
      <link>https://dev.to/mihail_147bfaf4bbb8ec9949/what-building-website-monitoring-taught-me-about-silent-failures-1jm</link>
      <guid>https://dev.to/mihail_147bfaf4bbb8ec9949/what-building-website-monitoring-taught-me-about-silent-failures-1jm</guid>
      <description>&lt;p&gt;I assumed when building NorthDuty that website monitoring would focus on the obvious failures.&lt;/p&gt;

&lt;p&gt;The site is down.&lt;/p&gt;

&lt;p&gt;The SSL certificate expired.&lt;/p&gt;

&lt;p&gt;DNS is broken.  &lt;/p&gt;

&lt;p&gt;The server returns &lt;code&gt;500&lt;/code&gt;.  &lt;/p&gt;

&lt;p&gt;That kind of thing.&lt;/p&gt;

&lt;p&gt;Oh and yes those problems are important. If a site is totally down you certainly want to know asap.&lt;/p&gt;

&lt;p&gt;But the more I worked on monitoring, the more I came to realize that the most irritating failures are not the loud ones. They are the silent ones.&lt;/p&gt;

&lt;p&gt;The website answers back.  The status code is acceptable.  The home page at least is loading, technically.  Doesn't look broken at a simple HTTP sniff.&lt;/p&gt;

&lt;p&gt;But for the user, something important is gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;200 OK&lt;/code&gt; can lie to you
&lt;/h2&gt;

&lt;p&gt;One of the first lessons was that 200 OK does not mean “the website works.”&lt;/p&gt;

&lt;p&gt;It only means the server returned a response.&lt;/p&gt;

&lt;p&gt;Simple to say, but easy to lose sight of when you're building monitoring. That's as far as a lot of uptime checks go. They'll make a request, check the status code, maybe look at response time, and consider that healthy.&lt;/p&gt;

&lt;p&gt;The problem is that users do not experience status codes.&lt;/p&gt;

&lt;p&gt;They experience pages.&lt;/p&gt;

&lt;p&gt;A frontend app can return 200 OK and the screen can stay blank because the JavaScript crashed. A dashboard can load the shell but never fetch the data. A pricing page can render, but the CTA button can be invisible because of a CSS change. A checkout flow can fail after the first step, while the homepage looks 100% fine.&lt;/p&gt;

&lt;p&gt;From the monitor’s point of view, everything passed.&lt;/p&gt;

&lt;p&gt;From the user’s point of view, the product is broken.&lt;/p&gt;

&lt;p&gt;That gap is where a lot of silent failures live.&lt;/p&gt;

&lt;h2&gt;
  
  
  The page can be alive and broken at the same time
&lt;/h2&gt;

&lt;p&gt;This was probably the biggest mental shift for me.&lt;/p&gt;

&lt;p&gt;I used to think of websites in black and white, before putting in deeper checks: up or down.&lt;/p&gt;

&lt;p&gt;Now I think there is a large middle area.&lt;/p&gt;

&lt;p&gt;A site can be “up” but unusable.  &lt;/p&gt;

&lt;p&gt;A page can load but be empty.  &lt;/p&gt;

&lt;p&gt;A form can appear but not submit.  &lt;/p&gt;

&lt;p&gt;A login page can work while the logged-in app is broken.  &lt;/p&gt;

&lt;p&gt;A marketing site can be perfect while the actual product is failing.&lt;/p&gt;

&lt;p&gt;That middle area is dangerous because it does not always trigger alarms.&lt;/p&gt;

&lt;p&gt;Monitoring just the homepage, you might overlook the dashboard. Monitoring just the API, you might overlook the frontend. Monitoring just response codes, you might overlook rendering issues. Looking only at logs, you might overlook what the user actually saw.&lt;/p&gt;

&lt;p&gt;This is why I started caring more about screenshots, visual diffs, and user flows.&lt;/p&gt;

&lt;p&gt;Not because they are fancy features, but because they answer a better question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What did the user actually get?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Screenshots are surprisingly honest
&lt;/h2&gt;

&lt;p&gt;A screenshot is not perfect, but it is hard to argue with.&lt;/p&gt;

&lt;p&gt;If the page is blank, the screenshot shows it.  &lt;/p&gt;

&lt;p&gt;If the layout exploded, the screenshot shows it.  &lt;/p&gt;

&lt;p&gt;If a cookie banner covers half the screen, the screenshot shows it.  &lt;/p&gt;

&lt;p&gt;If the main content never appears, the screenshot shows it.&lt;/p&gt;

&lt;p&gt;Adding screenshot-based monitoring changed my perspective on failures. Logs are good, but logs say what the system thought happened. Screenshots say what the user likely saw.&lt;/p&gt;

&lt;p&gt;That difference matters.&lt;/p&gt;

&lt;p&gt;Of course, visual monitoring has its own problems. Pages are messy. Dates change. Animations move. Fonts load at slightly different times. Ads and third-party widgets do whatever they want. If you alert on every pixel difference, you will create noise fast.&lt;/p&gt;

&lt;p&gt;Screenshotting the changes is easy. The hard part is knowing which changes to screenshot.&lt;/p&gt;

&lt;p&gt;A visual diff that reports “this page changed by 0.3%” is not inherently useful. A visual diff that highlights “the hero section has disappeared”, “the layout has shifted”, or “the app has rendered an error screen” is quite useful.&lt;/p&gt;

&lt;p&gt;That kind of failure is easy to miss with traditional uptime monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  User flows matter more than pages
&lt;/h2&gt;

&lt;p&gt;Yet another useful thing I learned: it's good to monitor a page, but it's better to monitor a flow.&lt;/p&gt;

&lt;p&gt;Most visitors aren't just coming to look at a product, they want to do something.&lt;/p&gt;

&lt;p&gt;They want to log in.  &lt;/p&gt;

&lt;p&gt;Create a project.  &lt;/p&gt;

&lt;p&gt;Submit a form.  &lt;/p&gt;

&lt;p&gt;Check a report.  &lt;/p&gt;

&lt;p&gt;Finish checkout.  &lt;/p&gt;

&lt;p&gt;Invite a teammate.  &lt;/p&gt;

&lt;p&gt;If one step in that path breaks, the product is broken for that user.&lt;/p&gt;

&lt;p&gt;This is particularly true for SaaS products. The homepage can look great while the app itself is not useable. The API can be up while authentication is broken. The dashboard can load while the main action button does nothing.&lt;/p&gt;

&lt;p&gt;A single URL check will not catch that.&lt;/p&gt;

&lt;p&gt;That's why I think user-flow monitoring is closer to actual reliability. It doesn't ask, "Does this URL respond?" It asks, "Can someone still complete the thing they came here to do?"&lt;/p&gt;

&lt;p&gt;That is a much better question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Silent failures are expensive because users rarely report them
&lt;/h2&gt;

&lt;p&gt;This part is easy to underestimate.&lt;/p&gt;

&lt;p&gt;When something breaks, we like to imagine users will tell us.&lt;/p&gt;

&lt;p&gt;Sometimes they do. Most of the time, they do not.&lt;/p&gt;

&lt;p&gt;They refresh.  &lt;/p&gt;

&lt;p&gt;They try again.  &lt;/p&gt;

&lt;p&gt;They assume the product is unreliable.  &lt;/p&gt;

&lt;p&gt;They leave.&lt;/p&gt;

&lt;p&gt;If they're shopping, you may never know you lost them. If they're a customer, they may silently lose faith. If they were ready to buy, they may not.&lt;/p&gt;

&lt;p&gt;That's what makes silent failures so aggravating. They do not necessarily cause a big fuss. They just quietly erode confidence.&lt;/p&gt;

&lt;p&gt;And because they are quiet, they can stay hidden longer than they should.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I monitor differently now
&lt;/h2&gt;

&lt;p&gt;Building website monitoring made me less satisfied with basic uptime checks.&lt;/p&gt;

&lt;p&gt;I still believe they're required. Status codes, SSL certs, DNS, response time and obvious server failures should be monitored.&lt;/p&gt;

&lt;p&gt;But I would not stop there.&lt;/p&gt;

&lt;p&gt;Now I care about things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Did the page actually render?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is the screen blank?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Did the frontend throw an error?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Did the visual layout change in an unexpected way?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Are important elements still visible?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Can the main user flow still complete?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is the mobile version still usable?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Are third-party scripts slowing down or blocking the page?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These checks are closer to what users actually experience.&lt;/p&gt;

&lt;p&gt;Because at the end of the day, nobody cares that your server technically responded if the page they needed is broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  The main lesson
&lt;/h2&gt;

&lt;p&gt;The most valuable lesson that &lt;a href="https://northduty.com" rel="noopener noreferrer"&gt;NorthDuty&lt;/a&gt; has taught me is that "up" is not equal to "working".&lt;/p&gt;

&lt;p&gt;A website can be online and still fail its users.&lt;/p&gt;

&lt;p&gt;That's easy to say but it changes the way you approach monitoring. You stop focusing on the infrastructure and you start focusing on the experience. You stop asking if the server responded and you start asking if the product still works from the outside.&lt;/p&gt;

&lt;p&gt;Silent failures are hard because they do not announce themselves clearly.&lt;/p&gt;

&lt;p&gt;You have to go looking for them.&lt;/p&gt;

&lt;p&gt;And once you start looking at sites like that, basic uptime monitoring starts to seem like a first layer.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>monitoring</category>
      <category>sre</category>
      <category>saas</category>
    </item>
  </channel>
</rss>
